home blog portfolio Ian Fisher

Type-safe generic data structures in C

7 June 2020 c 4 programming-languages 4
Subscribe

The rise of a new generation of low-level programming languages like Rust, Go and Zig has caused C and its primitive type system to fall into some disrepute. Nonetheless, with sufficient creativity it is possible to achieve surprisingly sophisticated results in C. One such result is generic data structures. This post reviews two techniques for implementing generic data structures in C: unsafely using raw memory and pointer casts, and safely using code generation through macros.1

Note: This post is intended for educational purposes only; use of these techniques in production C code is discouraged. For a more robust implementation, see tylov/STC on GitHub.

Warm-up: an int stack

The generic data structure we will be implementing is a stack. As a warm-up, we'll write a regular, non-generic stack that only works for int values. Our minimalist stack will support only two operations, push and pop—not the most useful data structure, but enough to cover the fundamental challenges of data structure implementation.

Our stack's internal state consists of a length, a capacity, and a pointer to the stack's heap-allocated data. The length is the number of elements in the stack while the capacity is the maximum number of elements that the allocated memory could hold (not the size of the allocated memory in bytes, which is capacity * sizeof(int)).

typedef struct {
    size_t len, capacity;
    int* data;
} IntStack;

Aside: You can find the full code for this post here.

We'll start with the push operation. First, the stack is resized if there is not enough capacity for an additional element. Then, the stack's length is incremented and the value is written to the end of the data array:2

void IntStack_push(IntStack* stck, int value) {
    if (!stck) {
        return;
    }

    if (stck->len + 1 > stck->capacity) {
        /* TODO: Handle arithmetic overflow. */
        size_t new_capacity = stck->capacity * 2;
        int* new_data = realloc(stck->data, new_capacity * sizeof(int));

        if (!new_data) {
            /* TODO: Handle memory error. */
            return;
        }

        stck->capacity = new_capacity;
        stck->data = new_data;
    }

    stck->len++;
    stck->data[stck->len - 1] = value;
}

Note that realloc may return a null pointer if it cannot re-allocate the memory, so it's not safe to assign the return value to stck->data without checking if it is null first. Also note that the use of realloc assumes that stck->data has been previously allocated with malloc, an assumption that will be upheld by the constructor that we'll write later.

The pop operation decrements the length field and returns the former last element. The return value is wrapped in an IntResult data structure, since in the case that stck is null or empty, there is no popped value to return.3

IntResult IntStack_pop(IntStack* stck) {
    if (!stck || stck->len == 0) {
        return IntResult_error();
    }

    stck->len--;
    return IntResult_of(stck->data[stck->len]);
}

IntResult and its constructors are defined as follows:

typedef struct {
    bool error;
    int result;
} IntResult;

IntResult IntResult_of(int v) {
    IntResult r = { .error = false, .result = v };
    return r;
}

IntResult IntResult_error() {
    IntResult r = { .error = true };
    return r;
}

We'll also provide a constructor for the convenient creation of IntStack objects. It initializes the stack with a length of 0 and a small initial capacity of heap-allocated memory.

IntStack IntStack_new() {
    size_t capacity = 8;
    int* data = malloc(capacity * sizeof(int));
    if (!data) {
        /* TODO: Handle memory error. */
    }
    IntStack stck = { .len = 0, .capacity = capacity, .data = data };
    return stck;
}

Since the constructor allocates memory from the heap, we need a corresponding destructor to free it:

void IntStack_free(IntStack* stck) {
    if (stck) {
        free(stck->data);
    }
}

And with that, we have a minimal but complete stack class:

IntStack int_stack = IntStack_new();
IntStack_push(&int_stack, 1);
IntStack_push(&int_stack, 2);
IntResult r = IntStack_pop(&int_stack);
assert(!r.error);
assert(r.result == 2);
IntStack_free(&int_stack);

Unsafe generic stack

IntStack only works for int values. If you wanted a char stack, you would have to write a another stack implementation. And if you did so, you would find that the code for CharStack is nearly identical to the code for IntStack, because the stack, like most container data structures, doesn't manipulate its elements in any way, it just stores them. The only thing it needs to know about them is how much memory each one of them occupies. So rather than writing different stacks for every element type, let's try writing a stack that works for any element type.

The critical insight is that, even though at compile time neither the type nor the size of the elements is known, the elements of the stack can still be stored as unstructured binary data as long as we keep track of how much memory each element occupies.

In C, binary data can be stored by in an array of chars. To give an example, suppose we define a Point type:

typedef struct {
    int x, y;
} Point;

We can store a Point object in a char array like this, using memcpy to copy the bytes into the array:

// Initialize an array with more than enough capacity.
char data[100];

// Initialize a Point object.
Point p = { .x = 42, .y = 43 };

// Copy the Point object into the array.
memcpy(data, &p, sizeof(Point));

// Print the bytes of the array.
for (size_t i = 0; i < sizeof(Point); i++) {
    printf("data[%ld] = %d\n", i, data[i]);
}

On my system, the for loop prints

data[0] = 42
data[1] = 0
data[2] = 0
data[3] = 0
data[4] = 43
data[5] = 0
data[6] = 0
data[7] = 0

But the individual elements of the array must be treated as opaque because the memory layout of Point is up to the compiler.

This technique is the foundation of our generic stack type, UnsafeStack.4 UnsafeStack has a char* data field instead of int* data, and an additional objsize field to track how many bytes each object occupies:

typedef struct {
    size_t len, capacity;
    size_t objsize;
    char* data;
} UnsafeStack;

The logic of resizing the array in UnsafeStack_push is similar to that of IntStack_push. As we saw in the Point example, we have to use memcpy to copy the value to the end of the stack, rather than assigning it directly to an index of the array, because the value could be arbitrarily large. UnsafeStack_push accepts a value of type void* so that any objects of any type can be passed to it.

void UnsafeStack_push(UnsafeStack* stck, void* value) {
    if (!stck) {
        return;
    }

    if (stck->len + 1 > stck->capacity) {
        size_t new_capacity = stck->capacity * 2;
        char* new_data = realloc(stck->data, new_capacity * stck->objsize);

        if (!new_data) {
            /* TODO: Handle memory error. */
            return;
        }

        stck->capacity = new_capacity;
        stck->data = new_data;
    }

    memcpy(stck->data + (stck->len * stck->objsize), value, stck->objsize);
    stck->len++;
}

Similarly, UnsafeStack_pop returns a pointer of type void*, because the type of the objects in the data structure isn't known at compile time. The pointer is an offset into the array calculated as stck->len * stck->objsize:

void* UnsafeStack_pop(UnsafeStack* stck) {
    if (!stck || stck->len == 0) {
        return NULL;
    }

    stck->len--;
    return stck->data + (stck->len * stck->objsize);
}

Errors can be signalled by returning a null pointer, so we don't need a Result object for UnsafeStack.

UnsafeStack_new and UnsafeStack_free are very similar to before:

UnsafeStack UnsafeStack_new(size_t objsize) {
    size_t capacity = 8;
    char* data = malloc(capacity * objsize);
    if (!data) {}
    UnsafeStack stck = { .len = 0, .capacity = capacity, .objsize = objsize, .data = data };
    return stck;
}

void UnsafeStack_free(UnsafeStack* stck) {
    if (stck) {
        free(stck->data);
    }
}

UnsafeStack's API is a bit different from IntStack's. When pushing a value onto the stack, we provide a pointer to it rather than the element itself, and when popping a value, we cast the return value and then de-reference it.

UnsafeStack_push(&unsafe_stack, &i);
int v = *(int*)UnsafeStack_pop(&unsafe_stack);

Since UnsafeStack_push accepts a pointer argument, we cannot pass integer literals to it.

Safe generic stack using macros

Unlike casts in Java and type assertions in Go, casts in C are entirely unsafe. The compiler will not complain at compile time and the program will not throw an exception at runtime if the cast is invalid. Rather, it will silently continue on to do strange and terrible things, like accessing uninitialized memory or overwriting memory belonging to other variables. Since C is a statically typed language, we would prefer to avoid these possible errors.

Let's return to the fundamental problem. If we want type-safe stacks, then we have to write a different, though virtually identical, implementation for each type. It's essentially a problem of code duplication.

Fortunately, C has a mechanism for dealing with code duplication: macros. Rather than writing out a full stack implementation for each type, we can use a macro to generate it for us from a template. We will take our IntStack implementation, replace all uses of int with a type parameter, and then wrap the whole implementation in a macro that is parameterized on type. Whenever we want to use a stack for a new type, we'll call the macro to generate the code for the implementation. As far as the C compiler is concerned, it's as if we wrote a separate implementation for each type with the concrete types in the code so that the compiler can type-check the code properly.

The macro code is a little hard to read, not the least because each line must end with a backslash to continue the macro on to the next line, but it is recognizably almost the same code as for IntStack. The syntax typename##_new, typename##_free, etc., tells the preprocessor to glue typename, a macro parameter, to literal strings like _new or _free, yielding results like FloatStack_new or StringStack_free. The macro parameter type is substituted wherever we had int in the original IntStack code.

#define DECL_STACK(typename, type) \
    typedef struct { \
        size_t len, capacity; \
        type* data; \
    } typename; \
    \
    typedef struct { \
        bool error; \
        type result; \
    } typename##Result; \
 \
    typename typename##_new() { \
        size_t capacity = 8; \
        type* data = malloc(capacity * sizeof(type)); \
        if (!data) {} \
        typename stck = { .len = 0, .capacity = capacity, .data = data }; \
        return stck; \
    } \
 \
    void typename##_free(typename* stck) { \
        if (stck) { \
            free(stck->data); \
        } \
    } \
 \
    size_t typename##_length(typename* stck) { \
        return stck ? stck->len : 0; \
    } \
 \
    void typename##_push(typename* stck, type value) { \
        if (!stck) { \
            return; \
        } \
 \
        if (stck->len + 1 > stck->capacity) { \
            size_t new_capacity = stck->capacity * 2; \
            type* new_data = realloc(stck->data, new_capacity * sizeof(type)); \
 \
            if (!new_data) { \
                return; \
            } \
 \
            stck->capacity = new_capacity; \
            stck->data = new_data; \
        } \
 \
        stck->len++; \
        stck->data[stck->len - 1] = value; \
    } \
 \
    typename##Result typename##_pop(typename* stck) { \
        if (!stck || stck->len == 0) { \
            typename##Result errorval = { .error = true }; \
            return errorval; \
        } \
 \
        type value = stck->data[stck->len - 1]; \
        stck->len--; \
        typename##Result r = { .error = false, .result = value }; \
        return r; \
    }

We then call DECL_STACK to declare a new stack type, either in a header file or at the top level of a program:

DECL_STACK(SafeIntStack, int)

The resultant API achieves the safety and convenience of IntStack and the generality of UnsafeStack:

SafeIntStack safe_int_stack = SafeIntStack_new();
SafeIntStack_push(&safe_int_stack, 1);
SafeIntStack_push(&safe_int_stack, 2);
SafeIntStackResult r = SafeIntStack_pop(&safe_int_stack);
assert(!r.error);
assert(r.result == 2);
SafeIntStack_free(&safe_int_stack);

Note that SafeIntStack is still only as safe as C's type system, which will, for example, allow you to use a string literal where an int is expected, with only a compiler warning. This is a fundamental limitation that cannot be worked-around.

The safe stack data structure has no overhead above hand-written code, except that each new declaration increases the program size by a constant amount (unlike the unsafe stack, which uses the same code for all data structures). Incidentally, this code generation technique is essentially how templates are implemented behind the scenes in C++.

That it is possible to write type-safe generic data structures in a language whose type system does not natively support them speaks to C's flexibility.5 But the techniques we used—unchecked pointer casts and unsanitary lexical macros—are themselves quite unsafe if used improperly. Flexibility, simplicity and safety: a language can attain at most two out of three. Rust and Haskell choose flexibility and safety. Go chooses safety and simplicity.6 C chooses flexibility and simplicity. Each choice has its trade-offs. The tendency for modern languages to prefer safety is a direct consequence of the innumerable bugs found in C code that could have been prevented by stronger compile-time guarantees. Nonetheless, as this post has shown, flexibility and simplicity are a powerful combination. ∎


  1. I assume the reader is proficient in C. The classic text on C is The C Programming Language by Brian Kernighan and Dennis Ritchie, universally known as K&R after the author's initials. Look for the second edition. Though K&R has aged remarkably well, best practices have been refined and some language features have changed since 1988. Modern C by Jens Gustedt is a great re-introduction to the modern language. 

  2. Here and throughout I've added TODOs to mark where a more robust implementation would need to handle an edge case. 

  3. Traditionally, error handling in C is done either by returning a special error value, or "out-of-band" by setting the global errno variable. I find the latter approach inelegant, and the former is impossible because any int value is a possible legal return value of IntStack_pop

  4. GLib, the utility library for the GNOME desktop environment, uses this technique for its garray generic array type. 

  5. For more evidence, see Daniel Holden's Cello, a framework for high-level programming in C. 

  6. Debatable, perhaps. Go does allow some unsafe constructs like interface{}, but it largely abandons the reckless permissiveness of C.