Design paper drafts

Project Tricarbon: Cimple Compiler Components

To get a better name when the project itself sucks less

IR Format

Pre-draft

Overview

The tricarbon v0 IR is an SSAT (Single Static Assignment Tree) form.

While the name "IRTree" is used to reference the tree that contains IR within this document, the name is purely a convention of the documentation. Implementations may refer to the tree in any manner they desire. If this happens to look like a copy+paste of the AST layout, that’s because it is.

packed struct IRTree{

     // version of the protocol the IR Tree targets. This MUST be zero.

    u32 version_min

    // Number of nodes - read 4 bytes * node_count to fill in the nodes array.

    // Note that this includes children!

    u32 node_count

    // The nodes themselves. Entries are either nodes, or indices of other nodes, or indices of strings.

    // This is an array, sized at node_count * 4.

    u32 nodes[]

    // Number of strings - read 8 bytes * string_count to fill in the strings array

    u32 string_count

    // The strings themselves. Strings are encoded as indices into the chars array plus lengths.

    // index-based slices, effectively.

    packed structure string {

        u32 index

        u32 len

    } strings[]

    // The number of characters

    // Read this number of bytes after char_count itself to populate the chars array

    u32 char_count

    // The raw character buffer. Strings in this buffer are NOT zero terminated.

    // Components MAY zero-terminate them, but MUST NOT assume that they are zero-terminated.

    char chars[]

}

packed struct node{

    // The node type

    u16 tag

    // The number of children of this node

    u16 kid_count

}

In the initial implementation, a hack was used: the first entry in nodes was the index of the root node, and not itself a node. This saved 12 bytes (the nodes array at the time was 8+4N, whereas now it’s 4+4N). I am not tolerating such stupidity from myself anymore: 12 bytes is not worth the special casing (chances are that the increase in code size offset it anyways!).

Now, the root node MUST be at index zero. It MUST have one child, and that child MUST be an IR_MODULE.

The current implementation also merges the AST and the IR nodes. As long as the AST nodes are not IN ANY WAY accessible during traversal, this is not an issue.

While nodes are identical to those used in the AST - and the overall layout is identical - the tags are not. A shared structure can be used for the AST and the IR layouts, so long as the tags - and traversal - are handled correctly, and so long as no attempt is made to use a tree of one type as the other.

Tags are currently specified as "whatever the reference implementation says." Traversal is currently specified as "whatever the reference implementation says." It may be possible to naïvely traverse the tree by printing tags and recursing downwards, but it might not; it depends on whether non-nodes are used as children.

The tree nature of the v0 IR means that the SSA register result of an instruction is its index. To avoid globally visible registers, this is offset by the first instruction in a block. SSA registers are only visible in the block containing them. Branches must explicitly pass all registers they use.

TODO: measure whether 8-bit tag and 24-bit kid count has a performance cost, and evaluate whether it’s worthwhile (a small performance hit might be acceptable). Alternately, maybe try to find a way to use a struct of arrays instead of an array of structs? e.g. struct{

    u8 tags[]

    u24 kid_counts[]

    u32 indices[]

}

The issue then is, of course, how to handle children indices. One option is to replace kid_count with first_child, and use null-termination. This isn’t a priority.

TODO: make "reference implementation" a link.