Thursday, June 06, 2013

Building a Lisp Interpreter from Scratch -- Part 10: Foreign Functions

(This is Part 10 of a series of posts on pLisp)

The foreign functions interface in pLisp is the gateway to access and use external code. It is defined by two special forms, LOAD-FOREIGN-LIBRARY and CALL-FOREIGN-FUNCTION. The first form takes a string (literal or otherwise) that represents the name of the shared library to load, and does a dlopen() on it. The successfully created handle to the library is stored internally by pLisp, and is used to service requests through the second form. The second form is the actual invocation of the library's exported function.

We use libffi for implementing these special forms.

The data types supported are integer, float, character and string, and pointers to these data types.

To illustrate with an example, if we have a shared library called libtest.so that exposes a function called fn_ret_float:


float fn_ret_float(int i, float f, char c, char *s)

{
  printf("%d\n",i);
  printf("%f\n",f);
  printf("%c\n",c);
  printf("%s\n",s);
  printf("%f\n", f + f);
  printf("exiting fn_ret_float\n");
  return f+f;
}

we invoke the function as:

$ ./plisp
Welcome to pLISP. Type 'quit' to exit.
USER> (load-foreign-library "libtest.so")
NIL
USER> (call-foreign-function "fn_ret_float" 'float
                             '((10 integer)
                             (6.5 float)
                             (#\a character)
                             ("abc" character-pointer)))
10
6.500000
a
abc
13.000000
exiting fn_ret_float
13.000000
USER>

The first parameter to CALL-FOREIGN-FUNCTION is the name of the to-be-invoked function, expressed as a string; the next parameter is the return type of the function; this is followed by a list containing the actual parameters and their respective data types. Passing the data types too may seem redundant, since pLisp can figure out the types on its own, but we still need to differentiate between pointers and non-pointers, so we have to live with this redundancy.

Another example:

int fn_arg_int_ptr(int *i)
{
  printf("passed value is %d\n", *i);
  *i = 100;
  return 0;
}

USER> (define i 10)
I
USER> (call-foreign-function "fn_arg_int_ptr" 'integer
                             '((i integer-pointer)))
passed value is 10
0
USER> i
100
USER>

This illustrates the above point about the redundancy -- if we do not mention the type of the argument (integer-pointer), pLisp will not know whether we need to pass the value of i or the address of i to the function.

A couple of points with respect to libffi:
  1. There seems to be some issues with using the ffi_arg struct for returning floats; I had to explicitly use a float variable for this.
  2. libffi is also used to implement the FORMAT special form. It handles the variable arguments requirement perfectly. More on FORMAT later.

Monday, May 27, 2013

May 27, 2013

(Obligatory 'Random thoughts on the IPL' post of the year)

1. If there is a single piece of imagery that succinctly captures the essence of this whole sordid IPL season -- no, make that the entire IPL -- it's the super slow motion footage (accompanied by faux operatic music, natch) of the scantily clad oxy blonde East European hooker cheerleader dancing with gay abandon, her face frozen in the rictus of feigned joy while holding an open Pepsi bottle that spews out its carbonated sugary poison all over the place, also in super slow motion. Lust, gluttony, greed, pride -- all there. Throw in  some footage of Sreesanth glaring balefully at a batsman while having a towel stuck in his waistband, and you've got wrath and sloth covered as well (brownie points if you figure out how sloth is involved).

2. In related news, the Oxford English Dictionary has decided to replace the present definitions of the words/phrases 'chutzpah', 'cojones', and 'conflict of interest' with just a single picture of Srinivasan. They claim that this change will save them about 0.13% in space, while adding 78% in clarity and simplicity to the definitions.

3. On a more serious note, the reaction of the BCCI president reminds me of Bill Clinton's impeachment troubles. The spin then was that it was a witch hunt that had to be resisted at any cost. What transpired subsequently, in terms of prosperity and economic growth ("It's the economy, stupid"), is purported to have vindicated this stand (ignoring the dotcom bust, and other downsides to bubble economics). Whether this is true in this case as well remains to be seen.

4. If the BCCI president has been able to stand up so far to the powerful forces arrayed against him, this either means that a) he has powerful backers in his corner supporting him behind the scenes or b) he has a few aces up his sleeve that serve as leverage. If you believe that the match/spot fixing scandal emerged as the result of genuine sleuthing and has nothing to do with the moves on the grand chessboard to control the riches offered by Indian cricket, I have lifetime season tickets for two for the hospitality box in all IPL matches to sell to you at a 90% discount.

5. The electronic media's coverage of the spot fixing scandal has been over the top as usual. While this is expected, Deccan Chronicle have also joined the party with their front-page attacks on Srinivasan. No doubt they're getting their revenge for some slight they suffered at his hands when they had a horse in the IPL race.

Friday, May 24, 2013

Building a Lisp Interpreter from Scratch -- Part 9: Serialization

(This is Part 9 of a series of posts on pLisp)

(Note: This post is quite out of sync with the code base; please see Part 13)

Serialization in pLisp is an all-or-nothing thing, i.e. not at the level of individual objects, but at the image level. You can invoke the CREATE-IMAGE special form at any point (at the top level, in the middle of execution, or while in debug mode) to dump the image to the hard disk to a file of your choosing. Loading the image is done by invoking pLisp with the file name as the argument.


$ ./plisp

Welcome to pLISP. Type 'quit' to exit.

USER> (defun fact (n)

        (apply * (range 1 n 1)))

FACT
USER> (fact 10)
3628800
USER> (create-image "test.image")
NIL
USER> q
Bye.

$ ./plisp test.image
Welcome to pLISP. Type 'quit' to exit.
USER> (fact 10)
3628800
USER> quit
Bye.

Nifty. An even more powerful aspect is the ability to do this in the middle of a debugging session (Hi, Smalltalk. It's been a long time, we should catch up sometime soon.):

$ ./plisp
Welcome to pLISP. Type 'quit' to exit.
USER> (let ((x 10))
        (break)
        (* 2 x))
BREAK
DEBUG> x
10
DEBUG> (create-image "debug.image")
NIL
DEBUG> (resume)
20
USER> q
Bye.

$ ./plisp debug.image
Welcome to pLISP. Type 'quit' to exit.
DEBUG> x
10
DEBUG> (resume)
20
USER> q
Bye.

Astute readers will observe that starting pLisp with the image takes us directly into the debugging session that was active when the image was created. The real utility of this feature is in allowing us to share the image file with somebody and enlist them in our dirty debugging efforts.

All these good things are possible courtesy of the TPL serialization library (thanks Troy, especially for help with the tricky bit involving packing arrays inside structs). The following variables capture the complete state of pLisp, and serializing/deserializing them is all it takes:

strings
top_level_env
free_list
last_segment
heap
current_package
gen_sym_count
packages
debug_mode
debug_continuation
debug_env
execution_stack
debug_execution_stack
reg_accumulator
reg_next_expression
reg_current_env
reg_current_value_rib
reg_current_stack

images.c is the place where all this action is.

Update: One feature not yet implemented is the serialization of references to foreign function libraries: if we load a bunch of .so's and then create an image, the handles to these libraries will be lost when we invoke pLisp with the image file. The way out is to serialize an array of strings containing the shared library names, and then do a dlopen() on these libraries afresh.

Monday, May 20, 2013

May 20, 2013

Quick, what's the first thing that comes to mind when you hear the word "shochaley"? If you answered "Vidya Balan", pat yourself on the back and stand in the line over there for your lollipop. I understand the importance of proper plumbing, the need for celebrities to lend their time to worthwhile causes, and so on, but if somebody told me that roping her in for this ad was a plot concocted by her rivals to damage her brand ("Jahan shochaley, vahan Vidya Balan"), I'll probably nod my head in thoughtful agreement.

Does anybody believe that spot-fixing in the IPL begins and ends with the three Rajasthan Royals players, and nobody else from the other franchises (the word 'franchise' itself reeks of the money-grubbing ethos of the whole thing) is involved? Please return the lollipop if you do. The convenient death of the person responsible for the tapping of the phones? Just a coincidence, please move along.

Must-read article about Ranbaxy.

Monday, May 06, 2013

Building a Lisp Interpreter from Scratch -- Part 8: Garbage Collection

(This is Part 8 of a series of posts on pLisp)

(Note: This post is somewhat out of sync with the code base; please see Part 13)

There are many algorithms available for garbage collection. We go with the simplest (well, the second simplest, since mark-and-sweep is simpler, but needs allocation of one bit in the object, something which we cannot do), i.e. tri-color marking.

Tri-color marking works like this: we have three sets of objects -- white, grey and black. The white set consists of all objects that can be garbage collected, while the black set contains objects that should not be (because somebody is holding on to references to these objects). The grey set contains objects that we're not sure about yet.

Before moving on, something about the when of garbage collection: we do GC after every execution of the REPL.

Whenever an object is created (through object_alloc()), it is added to the white set. If, after execution of the REPL, we find that it's still in the white set, it is GC'd. Objects save themselves by first getting themselves promoted to the grey set and then to the temporary haven that is the black set. Temporary because, as they say, you're only as good as your last hit, so beware the next call of gc() -- nobody may be holding your hand when the music stops.

The grey set starts off with all the root references. A root reference is an object that we're sure is reachable and hence cannot be GC'd. In the case of pLisp, the root references are all the top-level objects, conveniently captured in the top_level_env CONS object. Thus our grey set starts with this single object.

For each root reference, we do the following:
  1. Move it to the black set
  2. Move all the objects it directly references to the grey set.
The above algorithm is applied recursively till we end up with no more objects in the grey set.

Now GC is just a simple matter of freeing up RAW_PTRs corresponding to the objects remaining in the white set.

We use a binary search tree to store the contents of the three sets:

struct node
{
struct node *left;
struct node *right;
OBJECT_PTR key;
} ;

with the OBJECT_PTR values serving as the key. A slight wrinkle is needed to make sure that the BST remains balanced when we remove a node with both children present: we toss a coin to determine whether we go down the left or the right sub tree.

How to determine the objects referenced by a given object is pretty straightforward:
  1. If it's a CONS object, check its CAR and CDR
  2. If it's a closure or a macro object, it will reference three CONS objects: a parameter list, the enclosing environment, and the body of the closure/macro.
  3. For array objects, check all the array elements
  4. If it's a continuation object, the CONS object corresponding to the current call stack will be the referenced object (remember the trickery we resorted to to efficiently store the current call stack in the continuation object?)
  5. Objects of other types (character, integer, float, symbols, strings) do not reference other objects.
The memory needed to store symbols and strings does not come from the heap, but from a dynamic (char **) array called 'strings'. pLisp does not do any GC for this yet; maybe in the future.

Integers and floats, while not referencing other objects, are actually allocated on the heap. Lopping off the tag from objects of these two types yields a RAW_PTR that indexes into the heap; the relevant location contains 32 bits that is the integer/float value of the object. This is very inefficient, actually, since the same number will be stored multiple times on the heap, but this is required if we want to leverage the full 32 bits to store the integer (only 28 bits were being used earlier [see Part 3]). Maybe we can use the flyweight pattern or something.

Friday, May 03, 2013

May 3, 2013

Patriotism is the last refuge of the scoundrel.

Watching Musharraf's travails after his return to Pakistan reminds me of a Tom and Jerry cartoon: the one where Tom sneaks furtively into a yard, only to be ambushed by the bulldog, his tail firmly in the grip of said bulldog's jaws, with periodic 'boing, boing' noises emanating whenever Tom's head makes contact with the hard ground. Not that I have any sympathy for the f*cker. Musharraf, that is; Tom I dearly love.

Speaking of bulldogs, Surendra's recent cartoon about the CBI and the Supreme Court is a classic.

Tuesday, April 23, 2013

Building a Lisp Interpreter from Scratch -- Part 7: Continuations

(This is Part 7 of a series of posts on pLisp)

Ah, continuations. The feature that needed so much wracking of the brain, so much rewrite of the code to implement.

All this frustration stemmed from a pretty innocuous thing: my literal interpretation of the definition of a continuation. A continuation, in the context of an expression being evaluated, is defined as a function of one parameter which, if invoked with the result of the sub expression, would execute the rest of the computation. Sounds straightforward, once you wrap your head around the confusing and poorly worded previous sentence. You look at the examples from Paul Graham's On Lisp and go 'Yeah, that sounds OK, I get it'.

Big mistake. You do not get it.

It's quite easy to form the lambda expression equivalent to the current continuation for a given expression/sub-expression. What is not easy is to figure out how to save the state of the partial computation till that point (some of the arguments may have been evaluated, some may not; if the continuation is invoked later, the context till that point has to be abandoned, and replaced with that of the continuation; and so on), when your interpreter is implemented as nothing more than a call to a function named eval(), which calls itself multiple times (we're implicitly riding on the coattails of the C call stack, basically).

Long story short, Kent Dvybig's PhD thesis to the rescue, continuations trivial to implement if the interpreter is a proper virtual machine with registers, call stack, etc.

Our continuations are modelled on those of Scheme. The relevant primitive is CALL-CC (CALL-WITH-CURRENT-CONTINUATION has only 30 characters, so I went with CALL-CC). We invoke CALL-CC with a lambda form which is passed the current continuation object by pLisp. Textbook Scheme.

$ ./plisp
Welcome to pLISP. Type 'quit' to exit.
USER> (define x 0)
X
USER> (define cont)
CONT
USER> (progn (call-cc (lambda (cc) (set cont cc)))
             (incf x)
             x)
1
USER> x
1
USER> (cont '())
2
USER> x
2
USER>

As I mentioned earlier, we resort to some trickery when we store continuation objects. The only thing a continuation object captures is the current call stack, which is a CONS object (a list of call frames). We lop off the tag bits from this CONS object and simply retag the bits 1010 -- CONTINUATION_TAG -- to it, and call it a continuation object. Cheap trick, but efficient.

Monday, April 22, 2013

Building a Lisp Interpreter from Scratch -- Part 6: Macros

(This is Part 6 of a series of posts on pLisp)

Macros are where pLisp's affinity to Common Lisp is most clearly manifest. The macro definition syntax -- use of backquotes, comma, and comma-at -- pretty much mirrors that of CL:

(defmacro first (lst)
  `(car ,lst))

I guess I don't have much else to say in this post, so maybe I'll ruminate on macros in general.

One key lesson I learned while doing macros in pLisp is that macros and closures are identical (as evidenced by the identical code in vm.c that handles both), except for when they're invoked (compile time versus run time). There is nothing special about the 'special' operators used by macros, i.e., comma and comma-at: you can make calls to these operators from closures and derive the same expected behaviour from them as you would when calling them from a macro:


$ ./plisp

Welcome to pLISP. Type 'quit' to exit.

USER> (defun test (lst)

        `(first ,lst))
TEST
USER> (test '(1 2 3))
(FIRST (1 2 3))
USER> Huh.

Another aspect of pLisp macros is that they are top-level only, i.e., there is no MACROLET or its equivalent. Shouldn't be too difficult to add this, but I'm not (yet) convinced that their absence is that critical.

Tuesday, April 16, 2013

Building a Lisp Interpreter from Scratch -- Part 5: Compiler, Virtual Machine

(This is Part 5 of a series of posts on pLisp)

Using a compiler and a virtual machine for an interpreter may seem somewhat of an overkill, but makes sense for two reasons:
  1. Incorporating continuations in pLisp showed me that a VM-based approach (hence a compiler) is a straightforward way to do so -- virtual registers, call stack, the whole works -- rather than mucking around with code walkers and the like.
  2. Writing a compiler would also be useful when we want to run pLisp code natively.
Before we proceed further, some attributions are in order: Kent Dybvig's PhD thesis is pretty much the reference for much of this post.

The last time we spoke about the interpreter per se, we had just converted an expression_t struct into a native pLisp object (OBJECT_PTR). We are now ready to do further things with this pLisp object:
  1. Compile it
  2. Interpret it
(who could have seen that coming?)

The compiler converts a pLisp source form into a simplified Lisp form that is understood by the VM/interpreter. The following are the constructs in the simplified Lisp:

HALTHalts the virtual machine
REFERLoads a variable reference
CONSTANTLoads a constant
CLOSECreates a closure
MACROCreates a macro
TESTImplements IF/ELSE
ASSIGNStores a value in a variable
DEFINECreates a variable binding in the top level
CONTICreates a continuation
NUATEExecutes a continuation by replacing the current stack with that of the continuation
FRAMEStores the register contents in a new frame and pushes the frame on to the call stack
ARGUMENTAdds the results of the last evaluation to the value rib
APPLYClosure/Macro/Continuation application
RETURNPops a frame, restores registers
BACKQUOTEProcesses a backquote (can't be done by the compiler because we need to do some compiling at run time; see below) 

These constructs are predicated on the existence of the following, for want of a better term, ISA elements:

//register that stores the 
//results of the last computation
OBJECT_PTR reg_accumulator;

//the next expression to be 
//evaluated by the VM
OBJECT_PTR reg_next_expression;

//the environment in which the
//current expression is to be evaluated
OBJECT_PTR reg_current_env;

//holds the arguments evaluated
//till now (for closure applications)
OBJECT_PTR reg_current_value_rib;

//the call stack
OBJECT_PTR reg_current_stack;

The VM interprets the form pointed to by reg_next_expression, by suitably manipulating the registers, and sets the next expression to be evaluated, thereby perpetuating the computation, till it runs out of expressions to evaluate.

The astute reader will observe that all these elements are first-class pLisp objects. This is imperative especially for the current stack, since continuations encapsulate the call stack (although, to be fair, there are ways to reify the stack using native mechanisms and still end up with a first-class continuation object).

The call stack itself is a CONS object made up of frames, themselves pLisp arrays containing the next expression, environment, and the current value rib.

The working of the compiler is somewhat peculiar; the simplified Lisp that it produces from a pLisp source form is something like a Russian-doll-within-doll toy. An example will make this clear:

The source form

(define x 10)

gets compiled into

(CONSTANT 10 (DEFINE X (HALT)))

i.e., we have three 'instructions' to be interpreted by the VM (CONSTANT, DEFINE and HALT), and the first instruction contains the second, the second contains the third, and so on. Looks a bit disconcerting at first, but it works.

The compiler and interpreter work pretty independently of each other, except when it comes to macros and EVAL. If the compiler detects that it's dealing with a macro, it sets things up so that a call is made to the interpreter, and receives the results of the interpreting for compilation. The interpreter enlists the help of the compiler to process BACKQUOTE, COMMA, COMMA-AT and EVAL.

Thursday, April 11, 2013

Building a Lisp Interpreter from Scratch -- Part 4: Memory System

(This is Part 4 of a series of posts on pLisp)

(Note: This post is quite out of sync with the code base; please see Part 13)

As I mentioned briefly in Part 3, memory in pLisp is allocated from a fixed-size heap and objects are referred via indexes (called RAW_PTRs) into this heap.

Memory is addressed in pLisp not at the byte level, but at the word (32 bit) level, as indicated by the typedef for OBJECT_PTR (unsigned int). This is inefficient when we're dealing which characters, for example, but efficiency is not the primary concern for us.

The interface to the memory system is through three calls:

RAW_PTR object_alloc(int);
void set_heap(RAW_PTR, OBJECT_PTR);
OBJECT_PTR get_heap(RAW_PTR);
  1. object_alloc() is the equivalent of C's malloc(). It takes as argument the number of four-byte words to be allocated, and returns a RAW_PTR that points to the newly allocated space in the heap.
  2. set_heap() sets the pointed-to memory location in the heap to the given OBJECT_PTR value.
  3. get_heap() is the read counterpart of set_heap().
There is no deallocation procedure because we have GC that kicks in right after a form has been evaluated and its results reported to the top level. GC is a topic for another post.

The memory system is a straightforward text book implementation. We maintain a list of available chunks/segments of memory in the form of a linked list (initially there is just one big segment as big as the full heap; this gets chopped up as allocation requests are serviced).

The first word of a segment contains its size; the next word points to the next available segment. The subsequent words constitute the space available in that segment:



We maintain two variables called 'free_list' and 'last_segment' to store the start and end of the linked list respectively. To allocate memory of a certain size, we walk the free list and stop when we encounter the first segment that is large enough for our needs. The segment is removed from the list and its space is given to pLisp; whatever is left over is packaged off into a new (smaller) segment. An obvious improvement would be to look for the segment whose size most closely matches the amount of memory requested, rather than settle for the first segment that meets our needs. When it's time to free memory -- through GC -- the segment to be deallocated is simply attached to the free list at the end.

To illustrate this with an example, here's the implementation of the CONS operator:

OBJECT_PTR cons(OBJECT_PTR car, OBJECT_PTR cdr)
{
  log_function_entry("cons");

  RAW_PTR ptr = object_alloc(2);

  set_heap(ptr, car);
  set_heap(ptr+1, cdr);

  insert_node(&white, create_node((ptr << CONS_SHIFT) + CONS_TAG));

  log_function_exit("cons");

  return (ptr << CONS_SHIFT) + CONS_TAG;
}

Ignore the call to insert_node() for now; this is for GC and doesn't have a bearing on what we've discussed so far. Creation of objects of other types in pLisp follows the same template:
  1. A call to object_alloc() to allocate the memory
  2. Call(s) to set_heap() to populate both the type-specific information (e.g. length for array objects) and the object content itself (array elements, for example)
  3. Decorating the RAW_PTR with the relevant object tag and returning it.
And we're done with memory.

Tuesday, April 09, 2013

April 9, 2013

Netherlands is the next crisis candidate to fall (via Mish):
The Netherlands is still one of the most competitive countries in the European Union, but now that the real estate bubble has burst, it threatens to take down the entire economy with it. Unemployment is on the rise, consumption is down and growth has come to a standstill. Despite tough austerity measures, this year the government in The Hague will violate the EU deficit criterion, which forbid new borrowing of more than 3 percent of gross domestic product (GDP).
It's a heavy burden, especially for Dutch Finance Minister Jeroen Dijsselbloem, who is also the new head of the Euro Group, and now finds himself in the unexpected role of being both a watchdog for the monetary union and a crisis candidate.
Is this the same joker who got into trouble over his 'what's happening in Cyprus is a template' comment?  What do you know, it is. If you swing your leg too far back to kick someone, watch out, it could be your own ass that's gets the whupping.

Shouldn't speak ill of the dead, but this is hilarious:
"How should we honor her? Let’s privatize her funeral. Put it out to competitive tender and accept the cheapest bid. It’s what she would have wanted." said left-wing film director Ken Loach, according to Yahoo Movies. The funeral at St Paul's Cathedral is expected to cost around £8 million.

Friday, April 05, 2013

Building a Lisp Interpreter from Scratch -- Part 3: The Object System

(This is Part 3 of a series of posts on pLisp)

(Note: This post is quite out of sync with the code base; please see Part 13)

As I mentioned briefly at the end of Part 1, objects in pLisp are denoted by OBJECT_PTROBJECT_PTR is nothing but a fancy name for unsigned 32-bit integers:

typedef unsigned int OBJECT_PTR;

The 32 bits in an OBJECT_PTR do two things:

1. Specify what type of object we're looking at
2. Depending on the type of the object, either store its value, or point to where the value is stored.

#1 is accomplished by the last four bits, called the tag bits, and the remaining 28 bits take care of #2. Before dissecting OBJECT_PTR, a brief digression is in order: we need to understand how memory is handled in pLisp. I plan to do a separate post on the memory model, so I'll stop with a few key pieces of information that are relevant here.

Memory is allocated out of a heap, whose size is specified in the code (TODO: pass this as a command line parameter):

#define HEAP_SIZE 8388608

This memory is in the form of an array indexed by RAW_PTR values which are, surprise surprise, unsigned integers in drag:

typedef unsigned int RAW_PTR;

Digression over.

The table below lists the pLisp object types and how an OBJECT_PTR value is deconstructed to yield them (click to see full size):




Most of this is pretty self-explanatory, however a couple of explanations are in order:


First, implementing the full IEEE floating point format is too painful (I had done this for Vajra), so I enlisted the help of the C runtime for this. A float value is stored in a union:


union float_and_uint
{
  unsigned int i;
  float f;
};

When we want to extract the float value from an OBJECT_PTR, we simply extract the first 28 bits from it, index into the heap using these bits,  and store the heap content at that index into the 'i' value of the union, and voila, the 'f' value automagically contains the correct float representation. The same process in reverse works for creating the OBJECT_PTR value.

Second, we resort to some hackery for continuation objects as well. The only piece of information a continuation object has to store is the stack object active at the time of the creation of the continuation, so it seems wasteful to use the heap for this. Instead, we simply convert the stack object (a CONS) into a continuation object by manipulating the last four tag bits.

Thursday, April 04, 2013

Building a Lisp Interpreter from Scratch -- Part 2: Core Forms

(This is Part 2 of a series of posts on pLisp)

Before looking at the object model of pLisp, we turn our attention to the core forms in our interpreter. Core forms are the primitive expressions that need to be handled by the interpreter; they cannot be palmed off to the libraries, where they could be implemented in pLisp itself.

Here's a list of the core forms. Quite a few of them are there because of necessity (see above), while some of them are more for convenience, i.e, life as we know it would not end if they were absent. Also, if you want to do anything practical, say arithmetic, strings, arrays, and so on, and not end up with an interpreter that just manipulates symbols, you need things like +, -, ARRAY-GET, and ARRAY-SET (not to mention PRINT for output).

'Core' core forms QUOTE, LAMBDA, IF, SET, CALL-CC, DEFINE, CONS, EQ, ATOM, CAR, CDR
Basic real-world stuff+, -, *, /, GT, PRINT, ERROR
Intermediate real-world stuffSTRING, MAKE-ARRAY, ARRAY-GET, ARRAY-SET, SUB-ARRAY, ARRAY-LENGTH, PRINT-STRING
Advanced real-world stuffCREATE-PACKAGE, IN-PACKAGE, CREATE-IMAGE, LOAD-FOREIGN-LIBRARY, CALL-FOREIGN-FUNCTION
ConveniencePROGN, SETCAR, SETCDR, LISTP, SYMBOL-VALUE
MacrosCOMMA, COMMA-AT, BACKQUOTE, MACRO, GENSYM
DebuggingENV, BREAK, RESUME, BACKTRACE, EXPAND-MACRO
Up to elevenEVAL

A few things:

1. EQ tests for equality of content.

2. QUOTE, BACKQUOTE, COMMA and COMMA-AT are not the operators' real symbols; they are indicated so to avoid confusion with punctuation (ditto for GT, but this has more to do with my laziness in working with HTML tags).

3. PROGN can actually be implemented as a macro using a series of LAMBDAs; I just went with a native implementation. Macros involving LAMBDAs are used for implementing WHILE and LET, however (the LET macro uses the MAP function in turn):

(defmacro while (condition &rest body)
  `(((lambda (f) (set f (lambda ()
                          (if ,condition
                              (progn ,@body (f)))))) '())))

(defun map (f lst)
  (if (null lst)
      nil
    (cons (f (car lst)) (map f (cdr lst)))))

(defmacro let (specs &rest body)
  `((lambda ,(map car specs) ,@body) ,@(map cadr specs)))

4. The other comparison operators are implemented using macros. 

5. LISTP can be implemented as a macro as well, through ATOM.

6. Strings are implemented as arrays, the corresponding GET and SET are again taken care of by macros.

7. DEFUN and DEFMACRO are macros that build on LAMBDA and MACRO respectively.

8. Did I mention that macros are awesome?

Astute readers may be wondering by now what kind of a bastard Lisp they're looking at. I initially started out with Common Lisp in mind (though building pLisp as a Lisp-1), then made a detour into Scheme -- mainly for the continuations -- before I decided to persist with the veneer of CL (maybe not just the veneer; the macro system is more or less full CL; I didn't venture into Scheme syntax rules territory at all). Anyway, to a large extent, one dialect of Lisp can be made to look like another by liberal and injudicious usage of macros, so it doesn't matter that much for our purposes, I guess.

Tuesday, April 02, 2013

Building a Lisp Interpreter from Scratch -- Part 1: Syntax and Parsing

(This is Part 1 of a series of posts on pLisp)

OK, first things first. Before we can start interpreting our lisp code, we need to parse it. Conventional wisdom says that we don't need to use tools like Flex and Bison for this, since Lisp syntax is simplicity itself, and we can roll out our own parsing and scanning functionality. We're not bowing to conventional wisdom; at least I didn't, so Flex and Bison it is. I can hear somebody in the back benches muttering "so much for the 'from scratch' bit": I assure you, this is the only place where we don't roll out our own (actually there are two other pieces of functionality where we make use of third party code, but those are orthogonal to or not directly related what we're trying to achieve, so they don't count. Sue me).

The tokens we're interested in are:
  • Symbols (abc, x, ...)
  • Integers
  • Floating point numbers
  • String literals ("Hello world")
  • Character literals (#\a)
  • Left and right parentheses
  • Quote
  • Back quote
  • Comma
  • Comma-At
Inquiring minds can now mosey over to GitHub for a look at plisp.lex.

In addition to these, we also handle single- and multi-line comments, ignore white spaces and arrow key presses.

A Lisp form begins and ends with parentheses. When all the currently open parentheses are closed, it means the interpreter has something to start interpreting. Needless to say, if you pass the interpreter a parenthesis-less form, i.e., a symbol, a literal or a number constant, this will also be interpreted.

Before we feed the interpreter the form in a, well, form suitable for interpretation, we need to discern its structure. This structure is represented by a typedef (oh, by the way, now would be as good a time as any to mention this: we're doing the whole thing in C):

typedef struct expression
{
  int type;
  char *package_name;
  char *atom_value;
  int integer_value;
  float float_value;
  char char_value;
  int nof_elements;
  struct expression **elements;
} expression_t;

The type member indicates what is the type of the expression, captured in these #define's:

#define SYMBOL 1
#define LIST 2
#define INTEGER 3
#define STRING_LITERAL 4
#define CHARACTER 5
#define FLOAT 6

(Source)

The rest of the fields store the value of the expression, depending on its type (e.g. atom_value will be for the case where the expression is a symbol, char_value for when it is a single character, and so on). The elements member stores the elements in case the expression is a list. An added wrinkle is the package name, to handle cases where the symbol name is preceded by the package name, as in 'math:power').

The conversion of the token stream into an expression_t structure is done in plisp.y, the scanner. The distilled logic is as below:

expression: atom | list;
atom: integer | float | string_literal | character_literal | symbol;
list: expressions_in_parens | quoted_expression | backquoted_expression | comma_expression | comma_at_expression;
expressions_in_parens: left_paren expressions right_parens;
quoted_expression: quote expression;
backquoted_expression: backquote expression;
comma_expression: comma expression;
comma_at_expression: comma_at expression;
expressions: /* empty */ | expressions expression;

On a side note, by suitable manipulation of the yyin variable, we can feed the interpreter input from stdin, or from a file of our choosing (e.g. to load the pLisp library).

Once yyparse() has had its way with the input, the interpreter is all set to begin. But we'll delay things a bit more, for a very good reason: in Lisp, code and data are the same (the fancy word for this is homoiconicity), so we get a lot of bang for our buck if we convert an expression_t object to a pLisp object (denoted by OBJECT_PTR) before giving the interpreter the go-ahead. The pLisp object model is the subject of another post.