Monday, May 06, 2013

Building a Lisp Interpreter from Scratch -- Part 8: Garbage Collection

(This is Part 8 of a series of posts on pLisp)

(Note: This post is somewhat out of sync with the code base; please see Part 13)

There are many algorithms available for garbage collection. We go with the simplest (well, the second simplest, since mark-and-sweep is simpler, but needs allocation of one bit in the object, something which we cannot do), i.e. tri-color marking.

Tri-color marking works like this: we have three sets of objects -- white, grey and black. The white set consists of all objects that can be garbage collected, while the black set contains objects that should not be (because somebody is holding on to references to these objects). The grey set contains objects that we're not sure about yet.

Before moving on, something about the when of garbage collection: we do GC after every execution of the REPL.

Whenever an object is created (through object_alloc()), it is added to the white set. If, after execution of the REPL, we find that it's still in the white set, it is GC'd. Objects save themselves by first getting themselves promoted to the grey set and then to the temporary haven that is the black set. Temporary because, as they say, you're only as good as your last hit, so beware the next call of gc() -- nobody may be holding your hand when the music stops.

The grey set starts off with all the root references. A root reference is an object that we're sure is reachable and hence cannot be GC'd. In the case of pLisp, the root references are all the top-level objects, conveniently captured in the top_level_env CONS object. Thus our grey set starts with this single object.

For each root reference, we do the following:
  1. Move it to the black set
  2. Move all the objects it directly references to the grey set.
The above algorithm is applied recursively till we end up with no more objects in the grey set.

Now GC is just a simple matter of freeing up RAW_PTRs corresponding to the objects remaining in the white set.

We use a binary search tree to store the contents of the three sets:

struct node
struct node *left;
struct node *right;
} ;

with the OBJECT_PTR values serving as the key. A slight wrinkle is needed to make sure that the BST remains balanced when we remove a node with both children present: we toss a coin to determine whether we go down the left or the right sub tree.

How to determine the objects referenced by a given object is pretty straightforward:
  1. If it's a CONS object, check its CAR and CDR
  2. If it's a closure or a macro object, it will reference three CONS objects: a parameter list, the enclosing environment, and the body of the closure/macro.
  3. For array objects, check all the array elements
  4. If it's a continuation object, the CONS object corresponding to the current call stack will be the referenced object (remember the trickery we resorted to to efficiently store the current call stack in the continuation object?)
  5. Objects of other types (character, integer, float, symbols, strings) do not reference other objects.
The memory needed to store symbols and strings does not come from the heap, but from a dynamic (char **) array called 'strings'. pLisp does not do any GC for this yet; maybe in the future.

Integers and floats, while not referencing other objects, are actually allocated on the heap. Lopping off the tag from objects of these two types yields a RAW_PTR that indexes into the heap; the relevant location contains 32 bits that is the integer/float value of the object. This is very inefficient, actually, since the same number will be stored multiple times on the heap, but this is required if we want to leverage the full 32 bits to store the integer (only 28 bits were being used earlier [see Part 3]). Maybe we can use the flyweight pattern or something.