Monday, June 23, 2008

Would you like some cream with that?

Just a quick update: The memory manager now organizes small objects of same memory type in blocks; a client needs to register a memory type before first allocation; for larger objects the memory type is stored right in front of the object itself (similar to Nickle's scheme). This allows me to just store the addresses instead of pairs of object and memory type addresses. Here is a brief documentation of the current API.

The next steps are (1) extension of the API by macros for tracking transient objects, (2) tearing open and stitching back together the Lush interpeter, (3) debugging, debugging, debugging, (4) implementing asynchronous collect.

Since there are no special functions for dealing with ATs anymore, I am feeling this could become a library...

Tuesday, June 17, 2008

Making ATs tiny

As I almost said in my last post, ATs are used almost everywhere in the interpreter. They are kind of abstract references that carry runtime type information and a pointer to a Lisp object. Small and ubiquitous objects like numbers and conses, though, are stored in an AT itself. In other words, when an AT references a number, then the AT does not contain a pointer to a number object but it contains the number itself, saving the cost of one level of indirection.

Another important role of ATs is that they are the handles to all objects for the current memory management system, which is based on reference counting. An AT contains a reference count to keep track of the number of references from other objects to the object the AT represents. Doing away with reference counting is one of the goals of this projects and so ATs will become smaller because we do not need to store reference counts anymore. On a 32-bit platform the current size of an AT is 16 bytes (4 words), which will drop to 12 bytes (3 words) without the reference count.

Could ATs be made smaller still? If we were pure and would not treat small objects special, then an AT could be just a pair of pointers---one pointer to the data representing an object and one pointer to a structure representing the object's type. An AT would be just 2 words tall (with 2 words, numbers and gptrs could still be represented directly, in fact). Would redefining ATs in this way be a good idea? Or would the level of indirection to be introduced for conses result in a noticeable performance hit?

What's the structure of a cons? A cons is just a pair of pointers (pointers to two other ATs). Note that this is the same structure that a "micro AT", just outlined, would have. This leads me to wonder if it could somehow be possible to represent conses and ATs using the same 2 word structure. An AT representing a cons would just be the cons itself rather than a reference to it. But how, then, could I distinguish ATs being conses from ATs being references to other kinds of objects?

I will give that some thought. But today I will start with simplifying ATs by throwing out flag bits that are currently used for fast runtime type checking but are not really necessary because the type is also encoded by a pointer to the type structure. My plan is to simplify the AT structure as much as possible and then switch to the new memory manager that I have been growing in a new source file but which is not linked in yet.

Wednesday, June 11, 2008

ATs are special

The most commonly used (dynamically allocated) object in the interpreter is the AT structure, by which conses, numbers and pointers are represented. ATs may well make 70% or more of all objects to be handled by the interpreter (my guestimate).

Since they come in such large numbers, I treat ATs special and allocate them in a dedicated list of blocks (Lush currently does this, btw). This helps alleviating the increased resource demand that comes with the new garbage collection scheme. First, I do not need to store address-memtype assocations for AT objects. This saves some space and decreases the n in the O(n) lookup time for all other addresses. The mark bits for ATs are stored in the blocks themselves, and marking an AT can be done in essentially the same way as in Nickle. That is, constant-time marking for most managed objects!

Monday, June 2, 2008

Lessons from Week One

Summer of Code is on for one week now. I have spent it more reading and thinking than coding but am much clearer about the design for Lush's new memory manager. As I said in my project proposal, Nickle's garbage collector is my primary model, but a few extra requirements force me do some things differently for Lush.

The first requirement is that the collector should utilize a second processor. Keith Packard suggested forking a child process to do the collection. The child would do the mark & sweep and send addresses of non-live heap objects back to the parent process via a pipe. The parent then does the actual reclamation (rebuilding of the free lists) in an incremental fashion (e.g., within the interpreter's idle function or as part of alloc). Creating a child process creates a lazy copy of the parent's address space (copy-on-write), and the cost is that of duplicating the parent's page tables and creating the child's task structure.

So far so good. Now Nickle's GC stores the mark bit just next to the memory object itself. This means that during mark and sweep, the GC needs to do memory writes for all live objects, which are scattered all over the heap. This would throw fork's low cost benefit out of the window. I therefore need to store the mark bits somewhere else, and compactly so in memory. Further, the GC needs to find the mark bit for a given object (address) quickly, or the mark phase will take forever. My solution is to store object address--GC data associations in a dense array, sort the array at the beginning of a collection cycle, and do binary search for the lookup.

Nickle's GC scheme allows very fast (constant time) access to the mark bit. My solution requires O(log n) time per lookup in an array with n addresses. It is a little ironic that utilizing a second processor requires doing much more work. It is quite possible that Nickle's scheme, using only one processor, does actually perform better in practice than the two-processor solution (however, I do not feel inclined to do the experiment and implement both schemes).

More on this later.