6 / 3 / 2020

Experimenting with Memory Management for Basil


Hey folks!


First time writing one of these! I'd normally just put a development update on Twitter (@elucentdev), but this topic felt like it was worthy of some greater discussion.

Ever since I began work on the implementation of my programming language, Basil, the problem of dynamic memory has lurked in the background. It's crucial for the implementation of a number of language features, but there's no one answer that achieves everything I want. This post will discuss what my goals were for a memory system, some of the options I considered, details about the option I've currently chosen, and a few extremely-preliminary performance numbers and conclusions.

I'll be talking primarily about memory management today, so this post will assume a basic knowledge of memory structure (stack, heap, etc), virtual memory, and common functions to allocate and manipulate memory (malloc, free, memcpy, etc).

With that said, let's jump into it!

Background


Unless you've been following my development updates (or, quite likely, even if you have), chances are you're not super familiar with the Basil programming language. I've been working on the current iteration for close to six months now, but it's still quite far from being usable. So, here's a quick primer with a few examples:

Basil is stack-based. Expressions are evaluated as terms are pushed onto a stack.

    # computes 1 + 2
    1 2 +   
            
Most stack-based languages support the above expression, where the + function applies to the elements beneath it when pushed. Basil additionally allows a function to apply to an element pushed onto it. In conjunction with partial function application, this means Basil supports lots of different argument orders.

    # all three of these compute the exact same expression
    + 1 2
    1 + 2
    2 1 +
            
To help structure Basil programs, there are also a number of operators in Basil built into the syntax to aid readability. These are processed into a stack-based equivalent using built-in functions by the compiler.

    # two equally-acceptable ways of declaring functions
    even? x ->
        x % 2 == 0
    
    (lambda (odd? x) ((2 x %) 1 ==))

    # using our function, as well as the built-in "if" and "print"
    if 10 even?: print "cool!"
            
I won't go into too much more of the syntax for now (you'll see some more code later). But you should also know that Basil is strongly, statically typed. Type inference allows for the elision of explicit types in many places.

Additionally, Basil exhibits value semantics for all of its types.

Motivations


So, Basil has a very flexible syntax. But due to its static typing, I don't think that should have any real impact on its speed. There aren't too many sticking points in the language's core that would make Basil fundamentally less optimizable than other languages.

...but there are a few. Basil natively supports strings, which can vary in size. Basil also supports function closures, which are extremely important to the language since taking multiple arguments off the stack requires currying. These two types are massively more problematic than any type whose size is known at compile time. Canonically, both would require heap allocation.

Among my goals for Basil is to keep the language lightweight, both the compiler and the executables it generates. At the same time, I also think Basil should be expressive. An idiomatic Basil program should be close to the minimum amount of code necessary to express the problem. A complex runtime compromises the first goal; things like alternate types of functions or strings to avoid allocation compromise the second.

So ultimately, here's what I was looking for, with regards to memory management:

Candidate: Automatic Reference Counting


Reference counting was the first thing I tried, because it's fairly straightforward. I wrote a small general-purpose allocator (about 6kB of x86_64 on my system) in C and included it in a core Basil library. After generating Basil's low-level IR, the compiler would insert reference-count increments and decrements at various points in the program.

There were a few problems with this, though.

First, it's costly! Attaching a reference count to every GC'd object, adding branching operations to increment and decrement counts...it ended up being quite a lot.

Second, it's hard to support refcounted and non-refcounted values with the same type. I tried including a unique refcount value for values like string constants, to signal that they didn't need to be freed - which mostly worked, but only made the aforementioned increment and decrement more costly.

Overall, this was enough for me to put a pin in reference counting, and explore alternatives.

Candidate: Tracing Garbage Collection


Tracing GC is perhaps the canonical best answer to dynamic memory management within a programming language. So I don't mean to knock GC when I say I didn't really consider this that strongly.

There's one simple reason: complexity. Currently Basil binaries require fairly little beyond their actual assembled code. The small core library I mentioned earlier is the only add-on. I won't claim to be an expert on tracing GC, but I'd expect a decent GC implementation to take tens of kilobytes at least, if not substantially more.

But there's another aspect of GC that I'm not a huge fan of, namely that it makes it much more complicated to reason about your memory use. It's harder to predict when objects will be freed, how large individual objects are, without in-depth knowledge of the language's runtime.

I'd like the average programmer to be able to try Basil and understand what it's doing under the hood reasonably well. So I decided not to explore GC, even if there are plenty of potential advantages.

A Different Approach


So far I've discussed two different strategies for creating and destroying objects on the heap. In general, I think a lot of the trade-offs involved with memory management have to do with the nature of the heap itself.

In particular, requesting more memory from the heap is costly, requiring the help of the operating system. Allocators do their best to avoid requesting this memory, so they employ carefully-managed data structures. Managing these data structures requires more code and more processing power, but the overhead is well worth it compared to requesting memory more frequently.

What if we were to avoid the heap entirely?

Well, what other memory do we have available? Our program's data and code sections are usually full of important information, so we can't tamper with them. But our program does get a fairly large stack.

While most programs interact with the stack in a pretty consistent manner, the bounds of the stack are just registers! As long as we don't go beyond the stack limit of our operating system, we can move them wherever we want.

Of course, the stack is generally used for function activation records. Basil has functions, so it would be nice to continue to use the stack for them. What we're looking for is a way to incorporate dynamically-sized objects into the language's ABI, and to do that, I came up with four things we generally need to be able to do:

Passing Arguments


We want to pass an argument to a function. But, we don't know how much memory that object is going to take up. We want a way to represent that object that isn't going to change depending on how large it is.

A pointer would do the trick! We can pass a pointer to the object, no matter how large it is, and access any of its data by reading from the pointer at different offsets. This is pretty straightforward, but remember: Basil exhibits value semantics for all its types. I'll talk about how we preserve those later.


Reading Values


If we pass objects by pointer, reading their values can be done by reading through the pointer. But, how do we know if a read will be safe? What if we read out of bounds of the object?

How about we also keep track of how large the object is? This is important for the remaining two operations as well, but it's important for most read-only operations too - for example, printing a string. Every dynamically-sized object gets a "size tag" in front of its data, which must always be equal to the size of its data in bytes.


This admittedly introduces overhead, but I think it's reasonable. I think it's a simple enough layout that the average programmer could work with it, and an extra size_t of memory per object isn't too bad considering these objects can be arbitrarily large.

Writing to Values


We have a pointer to some object, and we want to modify the object. This is where value semantics need to be preserved: if we modify a value through the pointer naively, we might modify a different value as well. Additionally, it's been legal thus far to use this methodology even for constants! But we can't write to a constant string.

This is where we need the stack. We implement copy-on-write. Whenever we modify an object, we copy it first and rereference our pointer to the new copy. We put this copy on the stack.

The procedure is actually quite simple. We know how large the object is from its size tag, so we increase the size of the stack frame to accommodate a copy of that object.


Then, a hand-optimized memcpy implementation the compiler emits is run to copy over the data.


Now we have a local copy of the object that we can modify independently of its source! Finally, we perform our modification.


Returning a Value


We have a pointer to some object. It may or may not have been copied to a local slot on the stack, and we don't know the size of the object ahead of time. This is where alloca would fail in C: how do we take our object and keep it alive as we leave the stack frame?

This procedure is fairly complex. Instead of returning, we take the return address and previous base pointer and save them to registers.


Next, we copy the object from its previous position to the edge of the previous stack frame. Its memory is now adjacent to the previous function's.


Next, we set the stack pointer to the address of our copied object. This shrinks the current stack frame so that it only contains the copied object.


Finally, we use the registers we wrote to earlier to restore the return address and base pointer. We've returned to the previous stack frame, but expanded it to accommodate our new object!


This might seem complicated - and relatively, it is - but it can be done fairly quickly. Because it happens at the end of the function, we're free to use any caller-saved register. There are more than enough registers to implement this whole procedure without any extraneous memory use or function calls.


So, ultimately, this is what I went with. By using the stack we're provided when our program starts, we avoid ever having to ask the operating system to map more memory for us. The concept of the call stack is common to most microarchitectures, so it should be fairly portable. And far and away the most expensive operation involved is the copy, something that would be necessary anyway to preserve value semantics when writing to an object.

Improvements


Of course, there are still some trade-offs, and the naive implementation I described above has a number of issues. Let's talk about some of them!

Avoiding Unnecessary Copies


As described, whenever we write to a reference, we make a copy. This is completely overkill. We can avoid loads of unnecessary copies with some simple analysis.

When do we actually run the risk of violating value semantics? Well, we don't want to modify something that has multiple references. This could be a constant or a different variable. Specifically, we don't want to modify its data, because then another reference could read that data and see that it's been changed somewhere else.

    x = "hello"
    x[0] = 'y'
    y = x

    # x needs to copy again here to avoid modifying y
    x[0] = 'm'
            
Only when there's a potential conflict like this is a copy truly necessary.

Reducing Memory Usage


Probably the biggest present issue with this stack-based allocation system is its high memory usage. While any form of garbage collection would be able to clean up values, the stack is linear. Within a stack frame, temporary values pile up more and more, until either the function returns or the stack overflows.

The high memory usage isn't a huge problem in itself, as the stack never gets particularly fragmented, and the operating system doesn't have to map more heap memory for our program on the fly. But in recursive functions, larger stack frames can be serious limiters on the possible depth of the function.

Even if we ignore recursive functions, there's still some particularly egregious examples of stack overflows occurring right now where they really shouldn't:

    # this loop will allocate one temporary per iteration and 
    # end up overflowing the stack well before it ends

    s = "a"
    i = 0
    while i < 1000000:
        s = s + "a"
        i = i + 1
            
It's apparent that the previous value of s can be discarded at the end of each iteration. But there's no easy way for the stack to deallocate it while keeping the new value intact.

I don't actually have a great answer to this at the moment. It's mitigated somewhat by lower overall memory usage, so the previous improvement helps here. But there are still cases that easily overflow. I have a few ideas but none detailed enough to write here at the moment.

Containers of Pointers


Basil supports arrays. What happens if we return an array of strings? As described, the array would be returned, copying over the pointers...which might point into the closed stack frame. Not good.

There's a clear solution here, but it's a headache, Essentially, Basil needs to support deep copy functionality for every composite type that contains dynamically-sized objects. If I have the aforementioned array of strings, Basil needs to traverse through it and copy each string, then coalesce them together to be added to the previous stack frame. I'm not sure I can think of a shortcut, so this is not something I'm looking forward to implementing...

Performance


On a more positive note, I did implement all of this behavior in the current Basil compiler. And while it's a bit too early to make serious assertions about it's behavior, I couldn't help seeing how stack-based allocation performed compared to other languages.

So, I implemented the following microbenchmark in Basil:

    f() ->
        a = "hello world"
        b = "lovely weather we're having"
        i = 0
        while i < 100000:
            t = a
            a = b
            b = t
            i = i + 1
        a
    
    j = 0
    while j < 1000:
        f() print
        j = j + 1
            
f() swaps two strings 100,000 times, then returns the first. The program calls f() 1,000 times and prints its result. I turned off all ownership analysis in Basil, so every assignment copies a string to a temporary and points the variable at that temporary. I basically just wanted to compare copying and creating strings, so this program is admittedly a bit contrived.

Anyway, I implemented the same program in C++:

    #include <string>
    #include <cstdio>
    
    using std::string;
    
    string f() {
        string a = "hello world";
        string b = "lovely weather we're having";
        int i = 0;
        while (i < 100000) {
            string t = a;
            a = b;
            b = t;
            i = i + 1;
        }
        return a;
    }
    
    int main(int argc, char** argv) {
        for (int i = 0; i < 1000; i ++) {
            printf(f().c_str());
        }
    }
            
...and in OCaml:

    open String

    let f () =
            let a = ref "hello world" in
            let b = ref "lovely weather we're having" in
            let i = ref 1 in
            while !i < 100000 do
                    let t = String.copy !a in
                    a := String.copy !b;
                    b := String.copy t;
                    i := !i + 1
            done;
            !a
    
    let () =
            let j = ref 1 in
            while !j < 1000 do
                    print_string (f ());
                    j := !j + 1
            
Again, these are contrived programs, and I know this isn't the best way to implement this in either language. But I wanted to specifically test allocation and copying here. C++ uses RAII and a built-in allocator to manage string memory, while OCaml has a tracing garbage collector. So I figure they're reasonably good examples of two alternative memory management schemes.

I compiled the C++ program with -O3, and the OCaml program with ocamlopt. Then, I timed all three executables on my laptop:


Pretty good!...if I do say so myself. As much as I'd love to brag about Basil being four times faster than -O3 C++, none of these languages are really being used correctly. But, this does demonstrate to me that there is a large potential performance boost in using the stack over an allocator.

Just for kicks, I popped both the Basil and C++ binaries in perf. Here's the C++ results:


At a glance, it looks like about 14% of the execution time was spent copying memory. Whereas malloc and free appear to have taken up about 35%! In comparison, Basil's results were just about what you'd expect:


70% of runtime spent copying memory, 30% in the inner function (which includes actually allocating the stack memory), and nothing in the main loop. Copying the strings dominates by far. Interestingly, the absolute time spent copying strings in both programs was about the same - roughly a half second. Which also isn't surprising - memcpy is pretty simple to implement effectively - but it's nice to see it reflected in some numbers.

Conclusion


Jeez. I wrote a lot more than I initially intended. If you made it to the end, thanks so much! I hope it was at least kind of interesting.

So, what does this all mean? What's the point of me writing this post? Well, I'm still in the very early stages of implementing this whole procedure, and there's still a bunch of holes. But I'm not familiar with any other language using this method for memory management. And despite some limitations, there are some clear cases where it results in a very large speedup compared to a traditional allocator.

I think the main point I'd like to demonstrate here is that first-class dynamically-sized objects can be entirely managed using stack memory. Whether or not it's better than using the heap for X or Y is a difficult question to answer. But I think there's potential in this technique. It could be used alongside a traditional allocator, perhaps as a "fast path" for small, single-allocation objects. Or it could be used in lieu of an allocator on an embedded system, potentially saving kilobytes of program memory.

For Basil, I'm just glad I have a system that works at the moment. Being able to easily handle dynamic objects is the first step towards implementing things like closures. So I'm going to be sticking with just stack-based allocation at the moment as I work on implementing those essential features. The language isn't at the stage of being "polished" or "practical" yet, so for now I'm just trying to get things working.

If you enjoyed this post, awesome! Let me know if you'd like to see more in the future, I could go in-depth on a number of PL-related topics that have come up while working on Basil.

If you're interested in the project, feel free to gawk at the horrible coding practices on the Basil GitHub page. Or follow me on Twitter! I post a lot of smaller development updates there.

Have a nice day!