Tuesday, April 23, 2013

Building a Lisp Interpreter from Scratch -- Part 7: Continuations

(This is Part 7 of a series of posts on pLisp)

Ah, continuations. The feature that needed so much wracking of the brain, so much rewrite of the code to implement.

All this frustration stemmed from a pretty innocuous thing: my literal interpretation of the definition of a continuation. A continuation, in the context of an expression being evaluated, is defined as a function of one parameter which, if invoked with the result of the sub expression, would execute the rest of the computation. Sounds straightforward, once you wrap your head around the confusing and poorly worded previous sentence. You look at the examples from Paul Graham's On Lisp and go 'Yeah, that sounds OK, I get it'.

Big mistake. You do not get it.

It's quite easy to form the lambda expression equivalent to the current continuation for a given expression/sub-expression. What is not easy is to figure out how to save the state of the partial computation till that point (some of the arguments may have been evaluated, some may not; if the continuation is invoked later, the context till that point has to be abandoned, and replaced with that of the continuation; and so on), when your interpreter is implemented as nothing more than a call to a function named eval(), which calls itself multiple times (we're implicitly riding on the coattails of the C call stack, basically).

Long story short, Kent Dvybig's PhD thesis to the rescue, continuations trivial to implement if the interpreter is a proper virtual machine with registers, call stack, etc.

Our continuations are modelled on those of Scheme. The relevant primitive is CALL-CC (CALL-WITH-CURRENT-CONTINUATION has only 30 characters, so I went with CALL-CC). We invoke CALL-CC with a lambda form which is passed the current continuation object by pLisp. Textbook Scheme.

$ ./plisp
Welcome to pLISP. Type 'quit' to exit.
USER> (define x 0)
X
USER> (define cont)
CONT
USER> (progn (call-cc (lambda (cc) (set cont cc)))
             (incf x)
             x)
1
USER> x
1
USER> (cont '())
2
USER> x
2
USER>

As I mentioned earlier, we resort to some trickery when we store continuation objects. The only thing a continuation object captures is the current call stack, which is a CONS object (a list of call frames). We lop off the tag bits from this CONS object and simply retag the bits 1010 -- CONTINUATION_TAG -- to it, and call it a continuation object. Cheap trick, but efficient.

Monday, April 22, 2013

Building a Lisp Interpreter from Scratch -- Part 6: Macros

(This is Part 6 of a series of posts on pLisp)

Macros are where pLisp's affinity to Common Lisp is most clearly manifest. The macro definition syntax -- use of backquotes, comma, and comma-at -- pretty much mirrors that of CL:

(defmacro first (lst)
  `(car ,lst))

I guess I don't have much else to say in this post, so maybe I'll ruminate on macros in general.

One key lesson I learned while doing macros in pLisp is that macros and closures are identical (as evidenced by the identical code in vm.c that handles both), except for when they're invoked (compile time versus run time). There is nothing special about the 'special' operators used by macros, i.e., comma and comma-at: you can make calls to these operators from closures and derive the same expected behaviour from them as you would when calling them from a macro:


$ ./plisp

Welcome to pLISP. Type 'quit' to exit.

USER> (defun test (lst)

        `(first ,lst))
TEST
USER> (test '(1 2 3))
(FIRST (1 2 3))
USER> Huh.

Another aspect of pLisp macros is that they are top-level only, i.e., there is no MACROLET or its equivalent. Shouldn't be too difficult to add this, but I'm not (yet) convinced that their absence is that critical.

Tuesday, April 16, 2013

Building a Lisp Interpreter from Scratch -- Part 5: Compiler, Virtual Machine

(This is Part 5 of a series of posts on pLisp)

Using a compiler and a virtual machine for an interpreter may seem somewhat of an overkill, but makes sense for two reasons:
  1. Incorporating continuations in pLisp showed me that a VM-based approach (hence a compiler) is a straightforward way to do so -- virtual registers, call stack, the whole works -- rather than mucking around with code walkers and the like.
  2. Writing a compiler would also be useful when we want to run pLisp code natively.
Before we proceed further, some attributions are in order: Kent Dybvig's PhD thesis is pretty much the reference for much of this post.

The last time we spoke about the interpreter per se, we had just converted an expression_t struct into a native pLisp object (OBJECT_PTR). We are now ready to do further things with this pLisp object:
  1. Compile it
  2. Interpret it
(who could have seen that coming?)

The compiler converts a pLisp source form into a simplified Lisp form that is understood by the VM/interpreter. The following are the constructs in the simplified Lisp:

HALTHalts the virtual machine
REFERLoads a variable reference
CONSTANTLoads a constant
CLOSECreates a closure
MACROCreates a macro
TESTImplements IF/ELSE
ASSIGNStores a value in a variable
DEFINECreates a variable binding in the top level
CONTICreates a continuation
NUATEExecutes a continuation by replacing the current stack with that of the continuation
FRAMEStores the register contents in a new frame and pushes the frame on to the call stack
ARGUMENTAdds the results of the last evaluation to the value rib
APPLYClosure/Macro/Continuation application
RETURNPops a frame, restores registers
BACKQUOTEProcesses a backquote (can't be done by the compiler because we need to do some compiling at run time; see below) 

These constructs are predicated on the existence of the following, for want of a better term, ISA elements:

//register that stores the 
//results of the last computation
OBJECT_PTR reg_accumulator;

//the next expression to be 
//evaluated by the VM
OBJECT_PTR reg_next_expression;

//the environment in which the
//current expression is to be evaluated
OBJECT_PTR reg_current_env;

//holds the arguments evaluated
//till now (for closure applications)
OBJECT_PTR reg_current_value_rib;

//the call stack
OBJECT_PTR reg_current_stack;

The VM interprets the form pointed to by reg_next_expression, by suitably manipulating the registers, and sets the next expression to be evaluated, thereby perpetuating the computation, till it runs out of expressions to evaluate.

The astute reader will observe that all these elements are first-class pLisp objects. This is imperative especially for the current stack, since continuations encapsulate the call stack (although, to be fair, there are ways to reify the stack using native mechanisms and still end up with a first-class continuation object).

The call stack itself is a CONS object made up of frames, themselves pLisp arrays containing the next expression, environment, and the current value rib.

The working of the compiler is somewhat peculiar; the simplified Lisp that it produces from a pLisp source form is something like a Russian-doll-within-doll toy. An example will make this clear:

The source form

(define x 10)

gets compiled into

(CONSTANT 10 (DEFINE X (HALT)))

i.e., we have three 'instructions' to be interpreted by the VM (CONSTANT, DEFINE and HALT), and the first instruction contains the second, the second contains the third, and so on. Looks a bit disconcerting at first, but it works.

The compiler and interpreter work pretty independently of each other, except when it comes to macros and EVAL. If the compiler detects that it's dealing with a macro, it sets things up so that a call is made to the interpreter, and receives the results of the interpreting for compilation. The interpreter enlists the help of the compiler to process BACKQUOTE, COMMA, COMMA-AT and EVAL.

Thursday, April 11, 2013

Building a Lisp Interpreter from Scratch -- Part 4: Memory System

(This is Part 4 of a series of posts on pLisp)

(Note: This post is quite out of sync with the code base; please see Part 13)

As I mentioned briefly in Part 3, memory in pLisp is allocated from a fixed-size heap and objects are referred via indexes (called RAW_PTRs) into this heap.

Memory is addressed in pLisp not at the byte level, but at the word (32 bit) level, as indicated by the typedef for OBJECT_PTR (unsigned int). This is inefficient when we're dealing which characters, for example, but efficiency is not the primary concern for us.

The interface to the memory system is through three calls:

RAW_PTR object_alloc(int);
void set_heap(RAW_PTR, OBJECT_PTR);
OBJECT_PTR get_heap(RAW_PTR);
  1. object_alloc() is the equivalent of C's malloc(). It takes as argument the number of four-byte words to be allocated, and returns a RAW_PTR that points to the newly allocated space in the heap.
  2. set_heap() sets the pointed-to memory location in the heap to the given OBJECT_PTR value.
  3. get_heap() is the read counterpart of set_heap().
There is no deallocation procedure because we have GC that kicks in right after a form has been evaluated and its results reported to the top level. GC is a topic for another post.

The memory system is a straightforward text book implementation. We maintain a list of available chunks/segments of memory in the form of a linked list (initially there is just one big segment as big as the full heap; this gets chopped up as allocation requests are serviced).

The first word of a segment contains its size; the next word points to the next available segment. The subsequent words constitute the space available in that segment:



We maintain two variables called 'free_list' and 'last_segment' to store the start and end of the linked list respectively. To allocate memory of a certain size, we walk the free list and stop when we encounter the first segment that is large enough for our needs. The segment is removed from the list and its space is given to pLisp; whatever is left over is packaged off into a new (smaller) segment. An obvious improvement would be to look for the segment whose size most closely matches the amount of memory requested, rather than settle for the first segment that meets our needs. When it's time to free memory -- through GC -- the segment to be deallocated is simply attached to the free list at the end.

To illustrate this with an example, here's the implementation of the CONS operator:

OBJECT_PTR cons(OBJECT_PTR car, OBJECT_PTR cdr)
{
  log_function_entry("cons");

  RAW_PTR ptr = object_alloc(2);

  set_heap(ptr, car);
  set_heap(ptr+1, cdr);

  insert_node(&white, create_node((ptr << CONS_SHIFT) + CONS_TAG));

  log_function_exit("cons");

  return (ptr << CONS_SHIFT) + CONS_TAG;
}

Ignore the call to insert_node() for now; this is for GC and doesn't have a bearing on what we've discussed so far. Creation of objects of other types in pLisp follows the same template:
  1. A call to object_alloc() to allocate the memory
  2. Call(s) to set_heap() to populate both the type-specific information (e.g. length for array objects) and the object content itself (array elements, for example)
  3. Decorating the RAW_PTR with the relevant object tag and returning it.
And we're done with memory.

Tuesday, April 09, 2013

April 9, 2013

Netherlands is the next crisis candidate to fall (via Mish):
The Netherlands is still one of the most competitive countries in the European Union, but now that the real estate bubble has burst, it threatens to take down the entire economy with it. Unemployment is on the rise, consumption is down and growth has come to a standstill. Despite tough austerity measures, this year the government in The Hague will violate the EU deficit criterion, which forbid new borrowing of more than 3 percent of gross domestic product (GDP).
It's a heavy burden, especially for Dutch Finance Minister Jeroen Dijsselbloem, who is also the new head of the Euro Group, and now finds himself in the unexpected role of being both a watchdog for the monetary union and a crisis candidate.
Is this the same joker who got into trouble over his 'what's happening in Cyprus is a template' comment?  What do you know, it is. If you swing your leg too far back to kick someone, watch out, it could be your own ass that's gets the whupping.

Shouldn't speak ill of the dead, but this is hilarious:
"How should we honor her? Let’s privatize her funeral. Put it out to competitive tender and accept the cheapest bid. It’s what she would have wanted." said left-wing film director Ken Loach, according to Yahoo Movies. The funeral at St Paul's Cathedral is expected to cost around £8 million.

Friday, April 05, 2013

Building a Lisp Interpreter from Scratch -- Part 3: The Object System

(This is Part 3 of a series of posts on pLisp)

(Note: This post is quite out of sync with the code base; please see Part 13)

As I mentioned briefly at the end of Part 1, objects in pLisp are denoted by OBJECT_PTROBJECT_PTR is nothing but a fancy name for unsigned 32-bit integers:

typedef unsigned int OBJECT_PTR;

The 32 bits in an OBJECT_PTR do two things:

1. Specify what type of object we're looking at
2. Depending on the type of the object, either store its value, or point to where the value is stored.

#1 is accomplished by the last four bits, called the tag bits, and the remaining 28 bits take care of #2. Before dissecting OBJECT_PTR, a brief digression is in order: we need to understand how memory is handled in pLisp. I plan to do a separate post on the memory model, so I'll stop with a few key pieces of information that are relevant here.

Memory is allocated out of a heap, whose size is specified in the code (TODO: pass this as a command line parameter):

#define HEAP_SIZE 8388608

This memory is in the form of an array indexed by RAW_PTR values which are, surprise surprise, unsigned integers in drag:

typedef unsigned int RAW_PTR;

Digression over.

The table below lists the pLisp object types and how an OBJECT_PTR value is deconstructed to yield them (click to see full size):




Most of this is pretty self-explanatory, however a couple of explanations are in order:


First, implementing the full IEEE floating point format is too painful (I had done this for Vajra), so I enlisted the help of the C runtime for this. A float value is stored in a union:


union float_and_uint
{
  unsigned int i;
  float f;
};

When we want to extract the float value from an OBJECT_PTR, we simply extract the first 28 bits from it, index into the heap using these bits,  and store the heap content at that index into the 'i' value of the union, and voila, the 'f' value automagically contains the correct float representation. The same process in reverse works for creating the OBJECT_PTR value.

Second, we resort to some hackery for continuation objects as well. The only piece of information a continuation object has to store is the stack object active at the time of the creation of the continuation, so it seems wasteful to use the heap for this. Instead, we simply convert the stack object (a CONS) into a continuation object by manipulating the last four tag bits.

Thursday, April 04, 2013

Building a Lisp Interpreter from Scratch -- Part 2: Core Forms

(This is Part 2 of a series of posts on pLisp)

Before looking at the object model of pLisp, we turn our attention to the core forms in our interpreter. Core forms are the primitive expressions that need to be handled by the interpreter; they cannot be palmed off to the libraries, where they could be implemented in pLisp itself.

Here's a list of the core forms. Quite a few of them are there because of necessity (see above), while some of them are more for convenience, i.e, life as we know it would not end if they were absent. Also, if you want to do anything practical, say arithmetic, strings, arrays, and so on, and not end up with an interpreter that just manipulates symbols, you need things like +, -, ARRAY-GET, and ARRAY-SET (not to mention PRINT for output).

'Core' core forms QUOTE, LAMBDA, IF, SET, CALL-CC, DEFINE, CONS, EQ, ATOM, CAR, CDR
Basic real-world stuff+, -, *, /, GT, PRINT, ERROR
Intermediate real-world stuffSTRING, MAKE-ARRAY, ARRAY-GET, ARRAY-SET, SUB-ARRAY, ARRAY-LENGTH, PRINT-STRING
Advanced real-world stuffCREATE-PACKAGE, IN-PACKAGE, CREATE-IMAGE, LOAD-FOREIGN-LIBRARY, CALL-FOREIGN-FUNCTION
ConveniencePROGN, SETCAR, SETCDR, LISTP, SYMBOL-VALUE
MacrosCOMMA, COMMA-AT, BACKQUOTE, MACRO, GENSYM
DebuggingENV, BREAK, RESUME, BACKTRACE, EXPAND-MACRO
Up to elevenEVAL

A few things:

1. EQ tests for equality of content.

2. QUOTE, BACKQUOTE, COMMA and COMMA-AT are not the operators' real symbols; they are indicated so to avoid confusion with punctuation (ditto for GT, but this has more to do with my laziness in working with HTML tags).

3. PROGN can actually be implemented as a macro using a series of LAMBDAs; I just went with a native implementation. Macros involving LAMBDAs are used for implementing WHILE and LET, however (the LET macro uses the MAP function in turn):

(defmacro while (condition &rest body)
  `(((lambda (f) (set f (lambda ()
                          (if ,condition
                              (progn ,@body (f)))))) '())))

(defun map (f lst)
  (if (null lst)
      nil
    (cons (f (car lst)) (map f (cdr lst)))))

(defmacro let (specs &rest body)
  `((lambda ,(map car specs) ,@body) ,@(map cadr specs)))

4. The other comparison operators are implemented using macros. 

5. LISTP can be implemented as a macro as well, through ATOM.

6. Strings are implemented as arrays, the corresponding GET and SET are again taken care of by macros.

7. DEFUN and DEFMACRO are macros that build on LAMBDA and MACRO respectively.

8. Did I mention that macros are awesome?

Astute readers may be wondering by now what kind of a bastard Lisp they're looking at. I initially started out with Common Lisp in mind (though building pLisp as a Lisp-1), then made a detour into Scheme -- mainly for the continuations -- before I decided to persist with the veneer of CL (maybe not just the veneer; the macro system is more or less full CL; I didn't venture into Scheme syntax rules territory at all). Anyway, to a large extent, one dialect of Lisp can be made to look like another by liberal and injudicious usage of macros, so it doesn't matter that much for our purposes, I guess.

Tuesday, April 02, 2013

Building a Lisp Interpreter from Scratch -- Part 1: Syntax and Parsing

(This is Part 1 of a series of posts on pLisp)

OK, first things first. Before we can start interpreting our lisp code, we need to parse it. Conventional wisdom says that we don't need to use tools like Flex and Bison for this, since Lisp syntax is simplicity itself, and we can roll out our own parsing and scanning functionality. We're not bowing to conventional wisdom; at least I didn't, so Flex and Bison it is. I can hear somebody in the back benches muttering "so much for the 'from scratch' bit": I assure you, this is the only place where we don't roll out our own (actually there are two other pieces of functionality where we make use of third party code, but those are orthogonal to or not directly related what we're trying to achieve, so they don't count. Sue me).

The tokens we're interested in are:
  • Symbols (abc, x, ...)
  • Integers
  • Floating point numbers
  • String literals ("Hello world")
  • Character literals (#\a)
  • Left and right parentheses
  • Quote
  • Back quote
  • Comma
  • Comma-At
Inquiring minds can now mosey over to GitHub for a look at plisp.lex.

In addition to these, we also handle single- and multi-line comments, ignore white spaces and arrow key presses.

A Lisp form begins and ends with parentheses. When all the currently open parentheses are closed, it means the interpreter has something to start interpreting. Needless to say, if you pass the interpreter a parenthesis-less form, i.e., a symbol, a literal or a number constant, this will also be interpreted.

Before we feed the interpreter the form in a, well, form suitable for interpretation, we need to discern its structure. This structure is represented by a typedef (oh, by the way, now would be as good a time as any to mention this: we're doing the whole thing in C):

typedef struct expression
{
  int type;
  char *package_name;
  char *atom_value;
  int integer_value;
  float float_value;
  char char_value;
  int nof_elements;
  struct expression **elements;
} expression_t;

The type member indicates what is the type of the expression, captured in these #define's:

#define SYMBOL 1
#define LIST 2
#define INTEGER 3
#define STRING_LITERAL 4
#define CHARACTER 5
#define FLOAT 6

(Source)

The rest of the fields store the value of the expression, depending on its type (e.g. atom_value will be for the case where the expression is a symbol, char_value for when it is a single character, and so on). The elements member stores the elements in case the expression is a list. An added wrinkle is the package name, to handle cases where the symbol name is preceded by the package name, as in 'math:power').

The conversion of the token stream into an expression_t structure is done in plisp.y, the scanner. The distilled logic is as below:

expression: atom | list;
atom: integer | float | string_literal | character_literal | symbol;
list: expressions_in_parens | quoted_expression | backquoted_expression | comma_expression | comma_at_expression;
expressions_in_parens: left_paren expressions right_parens;
quoted_expression: quote expression;
backquoted_expression: backquote expression;
comma_expression: comma expression;
comma_at_expression: comma_at expression;
expressions: /* empty */ | expressions expression;

On a side note, by suitable manipulation of the yyin variable, we can feed the interpreter input from stdin, or from a file of our choosing (e.g. to load the pLisp library).

Once yyparse() has had its way with the input, the interpreter is all set to begin. But we'll delay things a bit more, for a very good reason: in Lisp, code and data are the same (the fancy word for this is homoiconicity), so we get a lot of bang for our buck if we convert an expression_t object to a pLisp object (denoted by OBJECT_PTR) before giving the interpreter the go-ahead. The pLisp object model is the subject of another post.

Building a Lisp Interpreter from Scratch


pLisp now has continuations. It took a lot of effort, slogging through the relevant literature, and a significant rewrite of the code, but it's finally done (I have not updated the GitHub repository yet; plan to do it shortly [Update: Done]).

Before proceeding further with other features, I thought I'd do a series of how-to posts on pLisp. These will serve as a sort of de facto documentation of pLisp (beats the 'You want documentation? Go to The Source, my friend' approach), and also provide pointers to folks looking to do their own projects (there's plenty of material out there on building/embedding an interpreter within a Lisp implementation; in my opinion, while this is good, it's a cop-out; such interpreters are meant mainly for illustrating concepts, trade-offs in design choices, etc. We still don't pierce the veil, so to speak. Not to mention the obvious hit in performance).

Here's a tentative list of posts:
Stay tuned.