Tuesday, January 23, 2007
An even longer entry to scare away the new readers
Now, about that article … (hold on, it's a long one)
I'm talking about the limitations of programming which force the programmer to think like the computer rather than having the computer think more like the programmer. These are serious, deeply-ingrained limitations which will take a lot of effort to overcome. I'm not being pretentious when I say that this will be the next big paradigm shift in programming. We will need to completely redefine the way we write programs.
In this article, I present my view and my current work toward Language Oriented Programming (LOP). First I will show what is wrong with mainstream programming today, then I'll explain the concept of LOP by using the example of my existing implementation, the Meta Programming System (MPS). This article is intended to give you a bird's-eye-view of LOP, to spark interest in the idea, and hopefully to generate feedback and discussion.
Language Oriented Programming: The Next Programming Paradigm
The concept of using the best language suited to the problem at hand dates back to the first computer language devises, Fortran. The very name, Fortran, stands for “Formula Translation” and was a way to express mathematical equations in the more natural notation of
xn1 = ((A * yn) + B) * xn * (1.0 - xn) yn1 = ((C * xn) + D) * yn * (1.0 - yn)
and have the computer take the work of translating that to
fld [A] fmul [yn] fadd [B] fmul [xn] fld [xn] fld1 fsub fmul fst [xn1] fld [C] fmul [xn] fadd [D] fmul [yn] fld [yn] fld1 fsub fmul fst [yn1]
(the sample above assumes the instruction set supports floating point—if it doesn't, then you have a lot more code to call subroutines to do the work of adding, subtracting and multiplying floating point numbers; it's not pretty)
And even if a single language doesn't perfectly handle the entire problem domain, it has been possible for quite some time to do multilanague compilations, depending up platform. The most common method is to embed some routines written in Assembly to speed things up (since everything is converted to Assembly eventually), but as long as an operating system defines a standardized way to pass data between routines, it's not that hard to mix-n match routines from multiple languages.
I'm looking at the technical information for OS-9 (not to be confused with Mac OS 9), specifically for software modules:
A module need not be a complete program or even 6809 machine language. It may contain BASIC09 “I-code,” constants, single subroutines, and subroutine packages.
The module header contains a value specifying the language type, with values defined for data only, 6809 machine code, BASIC09-I-code, Pascal I-code and COBOL I-code. In theory then, you can construct, say, a pay roll system using a Fortran compiler to generate the math routines into 6809 code, COBOL to generate the business rules, and allow extentions to be written in BASIC. The language type is probably there for two reasons; 1) to know how to run the module, and 2) to figure out what parameter passing conventions to use (if there are any differences between the languages).
AmigaOS has a similar feature—you can create external libraries in any language you care to (say, in Pascal) and call them in programs written in any other language (say, in C—since the operating system itself is a library, any compiler on the Amiga had to support the paramter passing convention used in the system libraries, and while the only compiler I used on the Amiga was a C compiler, it had extentions to make external libraries, so I assume other compilers would have the same).
Even Microsoft Windows and IBM's OS/2 had a similar feature—DLLs. And from what I understand, it's not uncommon to have a “program” using DLLs written in C, C++ and Visual Basic at the same time.
But what Sergey Dmitriev is talking about is not multi-language programs per se but developing a domain specific languages to solve the problem. Or multiple domain specific languages if that's required.
And I have problems not only with that concept, but with his particular way of doing that as well. And no, it's not having to learn a million tiny computer languages (that's actually expected in this industry—when I started, the first language taught at FAU was Fortran, and while I was there, I saw the switch to Pascal and then C; after I left, it went to C++ and then Java).
That's not to say I have anything against DSLs—I think they're great, and I've written one or two in my time. When I worked at FAU, I ended up writing a DSL to generate programs to run Miller's Analogies (I worked as a programmer in the Math Department for a professor of Psychology—no, I don't fully understand how or why a psychologist ended up in the math department) which looked like:
mat 'CROWN' isto 'ROYAL' as 'prayer' 'crucifix' 'priesthood' 'bible' [sel] isto 'RELIGIOUS' 'b' answer mat 'SMALL' isto 'tiny' 'petite' 'little' 'diminutive' [sel] as 'LARGE' isto 'BIG' 'c' answer mat 'WORM' isto 'BIRD' as 'MOUSE' isto 'man' 'snake' 'rodent' 'lion' [sel] 'b' answer mat 'artist' 'description' 'narration' 'personality' [sel] isto 'CHARACTERIZATION' as 'PICTURE' isto 'PORTRAIT' 'b' answer mat 'orate' 'sing' 'mumble' 'speak' [sel] isto 'TALK' as 'SCRAWL' isto 'WRITE' 'c' answer
But there was one problem with this, and it's one of the problems I have with DSLs—the time it took me to write the DSL, which generated C code to handle the Miller's Analogies exceeded the time it would have taken me to write the C code initially! How much time? It took me maybe a month or so to write the DSL to the point where it would generate the necessary C code, maybe longer—it's been over ten years.
Now, I was lucky in that my job wasn't exactly fast paced, and I had the luxury of playing around with language design even though that wasn't my job. But as it was, I only did the one program with Miller's Analogies and from a purely economical standpoint, the work I did was wasteful (educational wise, it wasn't, as I was able to later use the work I did for the DSL as part of a class project, but sometimes it can be hard to justify research work in a non-research environment). So basically, unless the time it takes to write the DSL and the solution in said DSL is smaller than the time it would take to write the solution in a pre-existing language, or the DSL can be used more than once, it's not economically viable.
The other problem I have with DSL—what if you have the wrong mental model of the problem?
Going back to my hypothetical SNMP language extentions. The code as I wrote it looked like (bug fixed in this version):
IPAddress destination[]; IPAddress mask[]; IPAddress nexthop[]; int protocol[]; int age[]; int metric[]; int type[]; OID sys = SNMPv2-MIB::sysObjectID.0; if (sys == SNMPv2-SMI::enterprises.5567.1.1) /* riverstone */ { destination = IP-MIB::ip.24.4.1.1; mask = IP-MIB::ip.24.4.1.2; nexthop = IP-MIB::ip.24.4.1.4; protocol = IP-MIB::ip.24.4.1.7; age = IP-MIB::ip.24.4.1.8; metric = IP-MIB::ip.24.4.1.11; type = IP-MIB::ip.24.4.1.6; } else if (sys == SNMPv2-SMI::enterprises.9.1) /* cisco */ { destination = RFC1213-MIB::ipRouteDest; mask = RFC1213-MIB::ipRouteMask; nexthop = RFC1213-MIB::ipRouteNextHop; protocol = RFC1213-MIB::ipRouteProto; age = RFC1213-MIB::ipRouteAge; metric = RFC1213-MIB::ipRouteMetric1; type = RFC1213-MIB::ipRouteType; } /* skip the printing part */
Nice. Concise. And totally wrong!
Well, not totally—it'll work as is, but it does have a major
problem, and it stems from my not fully understanding the SNMP protocol (ah, gotta
love those leaky abstractions). SNMP is a network protocol, and each request for
data via SNMP
requires a round trip across the network. That's fine for single values,
like the request for SNMPv2-MIB::sysObjectID.0
. But the MIB RFC1213-MIB::ipRouteDest
is a “table” (an “array” in
SNMP-speak) and each
element requires a request (it's a bit more complicated than that, but it's
not germane to this discussion). The code above is basically doing the
following (and I'll only follow the Cisco branch since it's a lot of code
here):
for ( i = 0 , mib = RFC1213-MIB::ipRouteDest; mib != NULL ; destination[i++] = mib.result ) { send(mib); /* send the requst */ /*-------------------------------------------------- ; due to the nature of SNMP tables, you get back ; not only the value of the MIB you asked for, but ; the MIB of the next entry in the table. ; ; Hey, I didn't write SNMP ;-------------------------------------------------*/ mib = receive(); } for ( i = 0 , mib = RFC1213-MIB::ipRouteMask; mib != NULL; mask[i++] = mib.result ) { send(mib); mib = receive(); } for ( i = 0 , mib = RFC1213-MIB::ipRouteNextHop; mib != NULL; nexthop[i++] = mib.result ) { send(mib); mib = receive(); } for ( i = 0 , mib = RFC1213-MIB::ipRouteProto; mib != NULL; protocol[i++] = mib.result ) { send(mib); mib = receive(); } for ( i = 0 , mib = RFC1213-MIB::ipRouteAge; mib != NULL; age[i++] = mib.result ) { send(mib); mib = receive(); } for ( i = 0 , mib = RFC1213-MIB::ipRouteMetric1; mib != NULL; metric[i++] = mib.result ) { send(mib); mib = receive(); } for ( i = 0 , mib = RFC1213-MIB::ipRouteType; mib != NULL; metric[i++] = mib.result ) { send(mib); mib = receive(); }
But the original code in ospfquery
is doing the
following:
mib.dest = RFC1213-MIB::ipRouteDest; mib.mask = RFC1213-MIB::ipRouteMask; mib.nexthop = RFC1213-MIB::ipRouteNextHop; mib.protocol = RFC1213-MIB::ipRouteProtocol; mib.age = RFC1213-MIB::ipRouteAge; mib.metric = RFC1213-MIB::ipRouteMetric1; mib.type = RFC1213-MIB::ipRouteType; i = 0; while(mib.dest) { send( /* yes, all this is sent */ mib.dest, /* in one packet */ mib.mask, mib.nexthop, mib.protocol, mib.age, mib.metric, mib.type ); mib.dest, /* and received in one packet */ mib.mask, mib.nexthop, mib.protocol, mib.age, mib.metric, mib.type = receive(); destination[i] = mib.dest.result; mask[i] = mib.mask.result; nexthop[i] = mib.nexthop.result; protocol[i] = mib.protocol.result; age[i] = mib.protocol.result; metric[i] = mib.metric.result; type[i] = mib.type.result; i++; }
Now, I should mention that send()
causes a single SNMP packet to be sent,
and receive()
receives a single SNMP packet. With that, you can now see that the
code as I envisioned it sends a metric buttload of traffic compared
to the code as ospfquery
implements (about seven times the
traffic). The other major difference is that my code requires
all the data to be collected before it can be printed, whereas the
code in ospfquery
can print the results as it gets them (not
that it does currently, but it can). So even if the two methods take the
same amount of time, the second one seem faster (it's the
perception that it's faster that is important in this case).
The distinction is subtle, and on our routers it's mostly academic, what
with their two dozen or so routes. But I actually ran ospfquery
on the core router, with its 78,500 routes, and the difference is
no longer academic (ospfquery
does not print routes until it
collects all the data, and at first, I thought the program might have
crashed; nope, it just took a while).
And the problem stems from my misunderstanding of the problem, and my proposed DSL, while it works, is sub-optimal and in fact, may cause problems as it scales (in this particular case).
I've also seen what happens to code over time as multiple people maintain the code. I shudder to think what would happen to a language over time as multiple people maintain it (hmmm … that might help explain Perl—I'll have to think on this).
Now, getting back to Mr. Dmitiev's article:
When a compiler compiles source code, it parses the text into a tree-like graph structure called an abstract syntax tree. Programmers do essentially the same operation mentally when they read source code. We still have to think about the tree-like structure of the program. That's why we have brackets and braces and parentheses. It's also why we need to format and indent code and follow coding conventions, so that it is easier to read the source.
Why do we resort to text storage? Because currently, the most convenient and universal way to read and edit programs is with a text editor. But we pay a price because text representations of programs have big drawbacks, the most important of which is that text-based programming languages are very difficult to extend. If programs are stored as text, you need an unambiguous grammar to parse the program. As features are added to the language, it becomes increasingly difficult to add new extensions without making the language ambiguous. We would need to invent more types of brackets, operators, keywords, rules of ordering, nesting, etc. Language designers spend enormous amounts of time thinking about text syntax and trying to find new ways to extend it.
If we are going to make creating languages easy, we need to separate the representation and storage of the program from the program itself. We should store programs directly as a structured graph, since this allows us to make any extensions we like to the language. Sometimes, we wouldn't even need to consider text storage at all. A good example of this today is an Excel spreadsheet. Ninety-nine percent of people don't need to deal with the stored format at all, and there are always import and export features when the issue comes up. The only real reason we use text today is because we don't have any better editors than text editors. But we can change this.
Language Oriented Programming: The Next Programming Paradigm
A laudable idea, but not quite so simple as Mr. Dmitriev makes it out to be. Sure, the example I gave before:
xn1 = ((A * yn) + B) * xn * (1.0 - xn) yn1 = ((C * xn) + D) * yn * (1.0 - yn)
Can be made easily into a graph:
(Okay, so it's a graph of just the first statement)
I even color coded it—red for modified variables, green for variable references, blue for constants, and black for operations. But what about the following bit of code for a Connecton Machine?
(DEFUN PATH-LENGTH (A B G) α(SETF (LABEL •G) +INF) ; step 1 (SETF (LABEL A) 0) ; step 2 (LOOP UNTIL (< (LABEL B) +INF) ; step 3 DO α(SETF (LABEL •(REMOVE A G)) (1+ (βMIN α(LABEL •(NEIGHBORS •G)))))) (LABEL B)) ; step 4
This algorithm, which finds the length of the shortest path between
A
and B
in graph G
is as follows:
- Label all vertices with +∞.
- Label vertex
A
with 0. - Label every vertex, except
A
, with 1 plus the minimum of its neighbor's labels. Repeat this step until the label of vertexB
is finite. - Terminate. The label of
B
is the answer.
The notation is a bit weird I admit, but basically, “α” is similar to
Common Lisp's MAPCAR
, which maps a function to each element of
a list, so the first line could be translated as:
(MAPCAR #'(LAMBDA (V) (SETF (LABEL V) +INF)) G)
The “•” is what we're mapping over, and “β” is a reduce operation—it maps a function over each element of a function; this function accumulates a single value result. For instance:
(REDUCE #'+ 0 '(3 7 2 99 4 3))
will add all the elements in a list and return a sum, whereas:
(REDUCE #'MAX 0 '(3 7 2 99 4 3))
will return the smallest value in a list. “α,” “β” and “•” aren't
actually MAPCAR
, REDUCE
and the target we're
iterating over, but it's close enough.
Now, getting back to my argument here. Again, the code itself is pretty
easy to turn into a tree (it already is in tree format, but it's hard to see
the tree for all the parenthesis) but still, what does that give us? While I
think the Connection Machine notation is elegant, C doesn't exactly have the
concept of MAPCAR
& Co. It can be faked, sure, but it's not
an inherent part of the language.
But what it seems Mr. Dmitriev is working on is a type of automated template system, where you define your extensions, then create a templated implementation for your target language, only using a specialized editor. You do this for everything—from simple operations and assignments up through function calls and modules. And you need to define the ways these things all interact (hopefully to avoid some of the darker corners of C++, for instance).
Seems like quite a bit work, even if Mr. Dmitriev states a lot of this is automagically generated (and that's scary in and of itself—automagically generated stuff; I tend to distrust automagically generated stuff unless I understand exactly what's going on).
And from the looks of Mr. Dmitriev's editor (in his system, you don't manipulate textual source code, which is primitive stone age programming) it appears to target Java; I wonder how it would fare in trying to extend C with Lisp like features? Or Lisp with Forth like features? Or even C with my propsed SNMP like features?
Oh, and the reason we still use text to store source code? Because I can still load 40 year old source code into my text editor and modify it. What happens 40 years down the line with code written in Mr. Dmitriev's editor?