Cimple Compiler Components
Project Tricarbon, or c3, is a compiler toolchain I am working on. The toolchain prioritizes correctness, simplicity, and readable code above all else, though compiler performance and runtime performance are secondary priorities.
This draft incorporates ideas from C17, Zig, $redacted, and Coyote (in no particular order), and is intended to be able to process all four (see the "language support" documentation for more information). The only public frontend at the moment is for Zig, so it is used for code samples in this draft.
Language-specific parsers produce one tree per input file, which is annotated with scoping information for declarative scopes. For Zig and $redacted, this includes a tree for every referenced file; whether an import is needed is unknown until analysis. Since language semantics may differ for similar instructions - for instance, unsigned addition may wrap, error, and so on - the tree built by the parser must be annotated with some semantic information for usage by analysis.
Tree resolution builds up a dependency graph, resolves identifiers and field lookups, and generates scope information for imperative scopes. For languages like Zig and $redacted, which use the normal source files as "headers", "headerization" is performed, using scoping information to replace references to imported decls with external values. This results in self-contained trees.
Next, the comptime engine analyzes all comptime calls - which may include runtime code which is detected as comptime-known - and patches the tree with the results. Type analysis checks for bad code, such as const a: u32 = -1; , resolves inferred types, and so on. Finally, expression analysis deciphers expression semantics and generates IR, patching expressions in the original tree with references to their IR.
Detailed design of the actual implementation
This section contains the design for the actual implementation, and as such it includes implementation details. This is not exact - any issues which come up during the course of implementation will be fixed, and updating this document is not a priority.
The parser is responsible for generating a scope for each decl. This information is used throughout tree resolution. For Coyote, which supports more conventional generics, the parser is additionally responsible for translating e.g. foo<T> into the equivalent of a zig function fn foo(T: type) type , and specializations into calls to that function, such as foo<u32> into foo(u32). For $redacted, use foo; is treated as a decl named foo whose value is treated as an import of the module in question. For Zig, imported files are added to the compilation queue. For $redacted, one invocation of the compiler processes one module, and imported modules are only headerized. C includes are treated as imports as well, which happen to not be assigned an identifier.
$redacted modules are analyzed at the file level, not the module level. After analysis, module trees are stitched together and cleaned up to form a single translation unit per module. Similarly, Zig is compiled on a per-file basis, and all files are stitched together into a single object after analysis.
Semantic analysis is managed by a small analysis driver which tracks trees and optionally manages per-tree worker subprocesses if parallelization is desired. At most one subprocess can be used for a given file, as multiple workers acting on a single tree would be mostly pointless, and would increase the complexity of the compiler significantly, infecting whole swaths of the design.
When all work is done by the main process, and references to other trees are found, analysis for the external tree is invoked directly. When workers are being used, the driver tracks work which needs to be done and launches a worker for a given tree whenever it decides, which must be when no other worker for the tree in question is active.
Semantic analysis is broken down into two overall stages: tree resolution and semantic extraction. Tree resolution focuses on understanding the tree itself, and consists of usage analysis and scoping resolution. Semantic extraction uses the understanding of the tree obtained during tree resolution to produce typed IR, and comprises type analysis, expression analysis, and the comptime engine.
Usage analysis creates a (per-tree) runtime call graph and a compile-time dependency graph. Runtime dependencies initially are identifiers, field lookups (e.g. "a.b"), and so on. Compile time calls are identical, but they also can contain a list of arguments. Compile-time calls with different arguments are treated as separate dependencies. Any given dependency is only noted once, no matter how many times it is used. The comptime graph tracks all compile time dependencies, not just calls, providing information which is crucial to semantic extraction.
Usage analysis recursively analyzes nodes, depending on context. For declarative scopes, such as structures, it analyzes any exported nodes. For functions, it invokes scope resolution for identifiers and field lookups. Scope resolution patches the tree to replace references with their associated decl, and invokes usage analysis for the target.
Usage analysis tracks nodes which have been analyzed, and never analyzes the same node twice. Scoping resolution maintains a cache mapping identifiers and members to resolved decls. This information is used to both avoid performing redundant work and to prevent infinite recursion.
For imports, usage analysis replaces references to the import with "headerized" structures instead - essentially, it inlines the imported file, eliding runtime function bodies (note: inline functions are never elided). When scoping analysis encounters lookups in a headerized structure, it notifies the analysis driver. If running in parallel, the driver will wait for the job acting on the external tree to finish (if one is running), and spawn a new process resolving any decls that have been discovered as needed.
Next, semantic extraction climbs the dependency graph from the bottom-up, analyzing each node found. Comptime calls are run through type analysis, expression analysis, and executed by the comptime engine. Other nodes are simply run through type analysis and, for functions, expression analysis. Leaves are always analyzed first, and non-terminating recursive chains are detected and rejected.
The type analyzer determines the type of every expression, decl, and so on; it verifies that types are correct, and resolves types. Expression analysis inspects every expression and generates IR, using type information obtained from type analysis. The comptime engine evaluates generated IR, patching the tree with the results.
The analysis manager MAY pass the final trees through a cleanup phase to shrink them. When analysis is finished, the analysis manager stitches trees together if necessary (a single tree is produced for Zig, and one tree per module for $redacted) and hands control back to the driver, which will e.g. run codegen.