Design paper drafts

Project Tricarbon: Cimple Compiler Components

To get a better name when the project itself sucks less

Architectural Design Document

pre-draft

Acknowledgements

In no particular order, a huge thanks to DarkUranium, floatcomplex, Eleanor, g_w1, and ikskuh for providing invaluable feedback which has COMPLETELY RUINED MY VISION^W^W^W^made the project a lot better. Obviously, all the mistakes are their fault and everything else is mine. Wait. Scratch that. Reverse it. Thank you.

Each of them has provided invaluable feedback which changed the very nature of the project. Thanks, fellow humans! Thanks for making me remember that I don’t have time to make the perfect project before I keel over dead, and I should settle for "eh, at least it’s not GCC."

(and thanks, ori, for inspiring me to up the snark level.)

Abstract

Project Tricarbon - C sub 3 , or "Cimple Compiler Components" - is yet another attempt at a simpler compiler infrastructure. Tricarbon defines a few simple protocols for communication between different components.

Notes

This document assumes familiarity with basic compilation concepts: you are expected to know basic terms such as AST, parsing, machine code, and so on. Function signatures are given in C syntax.

Overview

There are a few classes of components, separated by their inputs and outputs. Parsers read in arbitrary inputs and produce ASTs. AST Processors read in ASTs and produce further ASTs. IR Generators read in ASTs and produce IR. IR Processors read in IR and produce further IR. Code generators read in IR and produce target-specific assembly, or machine code. Linkers take in target-specific code and produce an object file. As a special case, composites are components which combine multiple steps into one. For instance, a frontend comprises parsing, ast processors, and an IR generator. A backend consists of code generation and linking. Generally, the more general term composite is used.

For version zero of the tricarbon protocol, the IR is an ad-hoc SSAT form, defined as "whatever the reference implementation says". No, this isn’t a good solution; it’s a temporary measure to avoid constantly needing to tweak the documentation every time I throw out the design and start fresh. The AST is much more well-defined, but much of the details are left deliberately underspecified for now, either to leave room to make decisions later, or to simply avoid wasting too much time on documentation at such an early stage of the project. During implementation of the reference library, more revision and detail will be worked out.

All components should be usable as a single function exported over the native C ABI, whose signature varies in accordance with the component type. In addition, all components should be usable in a piped sequence, where applicable, such that a full pipeline can be constructed by piping arbitrary components serially.

Parser

A parser is any component which reads in arbitrary inputs and produces an AST (TODO forward-reference to AST definition). The input must be mappable to a c3 AST to be valid (i.e. some form of programming language). The signature for a parser is int(char *, AST*).

Component-functions must return one on success and zero on failure. The first argument must be an absolute path (resolved by the driver) containing the file to read. The second argument must be the address of a memory location to which to write the produced TIRE. The reader must NOT write to the address containing the path.

AST Processor

An AST processor is any component which reads in an AST and outputs another AST. AST Processors can instrument function calls with profiling information, perform optimizations, perform analysis, and so on. A given AST processor may depend on analysis information being present; such processors must simply be invoked after the requisite analysis. Convoluted dependency tracking is out of scope; just don’t be a moron (if your component needs information: CHECK if it’s there and gracefully error out if it’s not). The signature for an AST processor is int(AST*, nonnull AST**).

Note that the second parameter is a pointer-to-pointer-to-AST. The second parameter MUST NOT be NULL; it is the caller’s burden to not pass NULL, the component need not check. The resultant AST is written to the address specified in the second parameter. The output AST MAY be the same tree as the first parameter; it is the caller’s responsibility to test for this case and handle it correctly.

packed structure AST {

     // version of the protocol the AST targets. This MUST be zero.

    u32 version_min

    // Number of nodes - read 4 bytes * node_count to fill in the nodes array.

    // Note that this includes children!

    u32 node_count

    // The nodes themselves. Entries are either nodes, or indices of other nodes, or indices of strings.

    // This is an array, sized at node_count * 4.

    u32 nodes[]

    // Number of strings - read 8 bytes * string_count to fill in the strings array

    u32 string_count

    // The strings themselves. Strings are encoded as indices into the chars array plus lengths.

    // index-based slices, effectively.

    packed structure string {

        u32 index

        u32 len

    } strings[]

    // The number of characters

    // Read this number of bytes after char_count itself to populate the chars array

    u32 char_count

    // The raw character buffer. Strings in this buffer are NOT zero terminated.

    // Components MAY zero-terminate them, but MUST NOT assume that they are zero-terminated.

    char chars[]

}

packed struct node{

    // The node type

    u16 tag

    // The number of children of this node

    u16 kid_count

}

The AST encodes a tree built over a list of nodes. Every AST MUST contain a root node. The node tag indicates which kind of node is encoded. kid_count contains the number of children of the given node. A node’s children are included directly after it in the nodes array. Each child is a 32-bit unsigned integer representing an absolute index of the child in the nodes array, or of a string.

Bad = 0

Comment = 1

Contextual = 2

Type = 3

Identifier = 4

Export = 5

Public = 6

Block = 7

Comments

Comments appear as nodes. This is to facilitate, for instance, semantic analysis of commented-out code by static analyers. After the first parse failure within a comment, no more nodes will be added. Comment tokens refer only to the comments themselves. Any descendants of a comment node have their token index replaced with a string handle.

Contextuals

A contextual node is any node whose tag is obvious from context, and which does not already have a tag designed. Their main purpose is to reduce the number of tags needed, and to avoid needing to set tags which will be ignored.

Types

Identifiers

An identifier is any token representing a user-provided identifier. Identifiers have one child, which is an index into the strings array; that string is the identifier.

Exports

Export nodes must contain exactly four children. First, the ABI. Typically, this is "c". Secondly, the exported name, which is typically the name of an exported function. Third, the type of the exported value, and lastly the value itself.

Public

Public nodes have a single child. Effectively, a public node "tags" another node as public.

Other

All other node types are left undefined for this draft. It is expected that large swathes of this specification will be discarded for the second draft, having been informed by the first components written against this draft. More details will be added then.