Tuesday, April 02, 2013

Building a Lisp Interpreter from Scratch -- Part 1: Syntax and Parsing

(This is Part 1 of a series of posts on pLisp)

OK, first things first. Before we can start interpreting our lisp code, we need to parse it. Conventional wisdom says that we don't need to use tools like Flex and Bison for this, since Lisp syntax is simplicity itself, and we can roll out our own parsing and scanning functionality. We're not bowing to conventional wisdom; at least I didn't, so Flex and Bison it is. I can hear somebody in the back benches muttering "so much for the 'from scratch' bit": I assure you, this is the only place where we don't roll out our own (actually there are two other pieces of functionality where we make use of third party code, but those are orthogonal to or not directly related what we're trying to achieve, so they don't count. Sue me).

The tokens we're interested in are:
  • Symbols (abc, x, ...)
  • Integers
  • Floating point numbers
  • String literals ("Hello world")
  • Character literals (#\a)
  • Left and right parentheses
  • Quote
  • Back quote
  • Comma
  • Comma-At
Inquiring minds can now mosey over to GitHub for a look at plisp.lex.

In addition to these, we also handle single- and multi-line comments, ignore white spaces and arrow key presses.

A Lisp form begins and ends with parentheses. When all the currently open parentheses are closed, it means the interpreter has something to start interpreting. Needless to say, if you pass the interpreter a parenthesis-less form, i.e., a symbol, a literal or a number constant, this will also be interpreted.

Before we feed the interpreter the form in a, well, form suitable for interpretation, we need to discern its structure. This structure is represented by a typedef (oh, by the way, now would be as good a time as any to mention this: we're doing the whole thing in C):

typedef struct expression
{
  int type;
  char *package_name;
  char *atom_value;
  int integer_value;
  float float_value;
  char char_value;
  int nof_elements;
  struct expression **elements;
} expression_t;

The type member indicates what is the type of the expression, captured in these #define's:

#define SYMBOL 1
#define LIST 2
#define INTEGER 3
#define STRING_LITERAL 4
#define CHARACTER 5
#define FLOAT 6

(Source)

The rest of the fields store the value of the expression, depending on its type (e.g. atom_value will be for the case where the expression is a symbol, char_value for when it is a single character, and so on). The elements member stores the elements in case the expression is a list. An added wrinkle is the package name, to handle cases where the symbol name is preceded by the package name, as in 'math:power').

The conversion of the token stream into an expression_t structure is done in plisp.y, the scanner. The distilled logic is as below:

expression: atom | list;
atom: integer | float | string_literal | character_literal | symbol;
list: expressions_in_parens | quoted_expression | backquoted_expression | comma_expression | comma_at_expression;
expressions_in_parens: left_paren expressions right_parens;
quoted_expression: quote expression;
backquoted_expression: backquote expression;
comma_expression: comma expression;
comma_at_expression: comma_at expression;
expressions: /* empty */ | expressions expression;

On a side note, by suitable manipulation of the yyin variable, we can feed the interpreter input from stdin, or from a file of our choosing (e.g. to load the pLisp library).

Once yyparse() has had its way with the input, the interpreter is all set to begin. But we'll delay things a bit more, for a very good reason: in Lisp, code and data are the same (the fancy word for this is homoiconicity), so we get a lot of bang for our buck if we convert an expression_t object to a pLisp object (denoted by OBJECT_PTR) before giving the interpreter the go-ahead. The pLisp object model is the subject of another post.