Design paper drafts

Project Tricarbon: Cimple Compiler Components

To get a better name when the project itself sucks less

The Hardest Problem

First v0 Draft ; Informal Rationale

A formal draft that focuses just on explaining how names are treated in tricarbon, without the backstory, will be made available before v1.

Overview

It is well known that naming is the hardest problem in computer science. You have probably heard that dozens of times. Most people think that the adage refers to the task of determining what name to give a project - such as, say, Tricarbon. That’s wrong. I will explain what it really means. More seriously, this document outlines the method used to give names to functions in Project Tricarbon v1.

Functions have names. This isn’t news - or, at least, it shouldn’t be. Function names are a bit trickier than they initially seem, however. This is especially true in Zig, but is a lesser concern even with C. Here are some examples:

foo.c:

static void

foo(void)

{

    ...

}

bar.c:

static void

foo(void)

{

    ...

}

In C, this is perfectly valid. Both of these functions exist only in the files containing them, and since it’s impossible for the identifier foo to ever be ambiguous, there’s no conflict. That is, there’s no conflict for the programmer. For a compiler which produces assembly, however, things aren’t necessarily so clear-cut.

Now, to be clear, this particular problem is not very difficult to solve - in fact, with most assemblers, the solution is just "punt the problem off to the assembler." Every good assembler understands the difference between module-internal procedures and globals, and will be able to handle them correctly. This isn’t even a bad solution; the compiler’s output must by necessity depend to some degree on the assembler anyways, and this functionality is crucial for those of us who program in assembly anyways, so it’s perfectly reasonable. Better yet, in C, functions can only really exist at the root scope of a file. You might well name the functions foo.c:foo and bar.c:foo if you wanted an unambiguous reference to use in a debugger.

The primary source of practical issues is when a given function can have multiple names. C, however, doesn’t allow arbitrary aliases - at least, not at the semantic level. If you want to alias a function, you can abuse macros. Since there’s no situation in which real ambiguity is possible, any call foo() 

can almost certainly be turned into a call foo 

assembly instruction, leaving the final label resolution to the assembler (and, if it’s external, the linker).

Zig, on the other hand, treats functions a little bit differently, in a way which presents a greater challenge. In zig, functions are just normal values of a function type. For instance, the following is legal:

foo.zig:

pub const foo: fn()void = fn()void {};

pub const bar = struct{

    // Unambiguously references the top-level foo

    pub const baz = foo;

};

bar.zig:

pub const foo: fn()void = fn()void {};

pub const bar = foo;

Both are public - but not exported. Both can be referenced unambiguously:

const foo = @import("foo.zig");

const bar = @import("bar.zig");

const foo = fn()void {

    foo.foo();

    foo.bar.baz();

    bar.foo();

    bar.bar();

};

For more fun, bar.foo could just be @import("foo.zig").foo; the issue is identical either way. In the case above, things are a little bit trickier. Obviously, "foo" is not a suitable name for any of the functions that show up. If that third foo appears in the root source file, "foo" may well be a reasonable name. When it comes to packages, the package name can be used as a prefix. However, all three "foo"s here appear in the same package, so the situation requires more thought.

To give the example in asm, while we could easily just use the name foo in foo.asm and bar.asm, in the file that references both, the external label foo becomes ambiguous.

One immediate option is to name them "foo.foo" and "bar.foo", taking the prefixes from the imported decls. This has a subtle problem: it’s wrong. The prefixes there are coming from the call site, and are not actually properties of the functions themselves. This does hint at a better solution, though - we simply need prefixes that are properties of the function instead of the call site.

The most obvious way to do this, in my view, is to assign names to every intermediate scope. Of course, this again leads to a problem: how?

const foo = struct{

};

We might well name that scope "foo". But what happens when you alias it?

const bar = foo;

If I invoke a function indirectly through bar, as bar.baz() for instance, what name should be used when deciding on the fully-qualified function name? "bar.baz" or "foo.baz"? What if I use the function through foo and through bar? Should the name depend on the call site?

Actually, that is a very reasonable solution. Effectively, instead of assigning a fully qualified name to each function, we can assign a list of names. When generating assembly, we’d simply print them all!

foo.baz:

bar.baz:

Then, at each call site, we don’t check for a name in the target function, and instead determine a name based on how we arrive at the scope.

This solution has drawbacks. Most obviously, it increases memory requirements a bit, and has a minor performance impact: now, instead of a single lookup to find the name, we must compare against all existing names for each call site. If we embed the fully qualified names in the analysis information, this will also make the protocol a little bit more complex.

One potential drawback could come with debugging. Hypothetically, if you set a breakpoint on foo.baz, it might not effect calls to bar.baz - except, of course, that debuggers don’t determine the call based on the name at the call site, but on the position counter. In theory, such a debugger could in fact exist. In practice, it is almost certainly a non-issue. That said, without explicitly checking how a given debugger works, it’s entirely possible that this is may be a serious issue, so it should not simply be ignored. It is worth looking into gdb and acid, as the two most relevant debuggers to the tricarbon project, to ensure this will not be a problem.

Now, even if the drawbacks are relatively minor, a related question needs to be asked: are the benefits worthwhile? I think so. Here’s an example: let’s say you’re debugging an application, and you set a breakpoint on some function foo. When you get to that function, you inspect the code using the debugger and see that it’s calling a function bar.baz, so you tell the debugger to set a breakpoint at "bar.baz" - and it tells you no such function exists. If the compiler decides on a name that differs from what you see, it negatively impacts your debugging experience, since you now need to try and guess what name is used.

The benefit is clearly crucial to a positive debugging experience - except, one problem remains. THe naming scheme proposed uses the package name as the initial prefix. That is, for a function "foo" in the root package, the name "root.foo" will be given to it. There will never be ambiguity, since ambiguous names within a package aren’t possible and every package needs a unique name, thus guaranteeing a unique prefix. However, this also means that if you tell gdb to set a breakpoint at, say, foo, it will not know what you’re talking about - giving the exact same problem as before.

Okay, so what if we just drop the root prefix? Well, now a new problem arises. "foo.bar" can mean the root package’s "foo.bar", or the "foo" package’s "bar". A smart debugger could use the context of the position counter and analyze the code to decide what to do, leaving the question moot, but we can’t rely on such behavior.

I could keep doing this back and forth, but I think I’ve illustrated the problem enough to jump to the conclusion: there is no perfect solution. That might be a surprise, but not all problems have good answers. In this case, it’s a matter of tradeoffs. The primary benefit gained by using multiple names ends up being completely moot, so now a single name might be a good choice. We have to just lay out the options and pick one.

As always, Tricarbon’s top concern is simplicity. The only absolute constraints are that ambiguity is prohibited, and the name has to at least be meaningful to the programmer - that is, we can’t just use an incrementing counter. Within those constraints, whatever option is simplest is what we should use.

If we use a single name, there’s a few options for how to decide the name. We could the name of the file containing the file, followed by the scopes, followed by the function name. For instance, in a.zig in the root source file:

const foo = struct{

    const bar = fn() void{};

    const baz = bar;

};

"baz" here is simply an alias for "bar," so it has no impact on naming. Any usage via baz will be resolved to the function, and thus receive the proper name. We could call this function root.a.foo.bar, assuming that we require all files to end in .zig (and thus don’t need it to appear in the name). We can also special-case the root package by removing the name, so that this function becomes .a.foo.bar.

This is unambiguous and meaningful, but is not the only option that fits those constraints. However, this option immediately jumps out as being very simple. A call of

foo.baz();

produces the assembly

call .a.foo.bar

Since in the root source file, I’m likely using the import

const a = @import("a.zig");

, it is probable that the call would appear as a.foo.baz(); While ignoring the alias might surprise the programmer, it should also not be something that they notice. We can provide the alias in the debug information so that breakpoints work, even if we don’t use the name internally, as a purely optional bonus that gains us back the main advantage of using multiple names.

Since this meets the constraints, and is exceedingly simple, seeking further options is unnecessary. More importantly, if there are unexpected drawbacks, we can come back to this decision for v2 and improve it. Better a good solution today than to spend a long time bikeshedding this and aiming for a never-approaching perfection.