Zig backends for people who picked INT as their dump stat

Published 2020-07-31 on A Heroic Pixel

Hello again! This would be the second post this month, but after I decided not to publish an angry rant over how stupid everyone in America is being right now (largely because I decided that that would also be stupid), this is promoted to the first post in nearly three months.

Per the title, I’m going to be writing this under the assumption that you, the reader, have a good working knowledge of Zig, but have absolutely no knowledge of the compiler. I’m assuming you have a solid understanding of programming, and that you know the meaning of “compiler,” “assembler,” “register”, and so on, and that you can navigate Zig’s source code.

If you don’t know Zig, I recommend looking at the official documentation and / or ZigLearn. I don’t have any resources off the top of my head for learning about low-level programming, but if you need help finding some, let me know.

If you don’t understand something, you want me to elaborate, you spot mistakes, or you want help actually writing a backend, feel free to contact me in any relevant location (the best two options are to ping me as pixelherodev in #zig on Freenode, or email me at noam@pixelhero.dev).

The SPU II backend is not yet done, but people asked for advice for writing a backend, and a number of my patches for the SPU II backend are more generally useful (16-bit ELF support, improved test harness, etc). I’m going to be rushing to get those merged separately from the SPU II backend immediately after publishing this article, but they are not yet there. I highly recommend waiting a few days from the time of publication (7/31/20) so that you don’t accidentally redo work that’s already waiting to be finalized and merged.

This post is a sort of hybrid; it’s both a write-up on the new SPU II backend, and a guide to writing new backends.

This article will be going up on gemini://pixelhero.dev/ within a week.

Introduction

What is a Zig?

So, you want to build a Zig backend, but don’t know where to start. Or, maybe, you found this article randomly on the internet. Maybe you’re curious about the SPU II backend, because you find the emulator cool. Or you found this not-randomly and you’re stalking me (please don’t do that, it’s not very nice). Well, have no fear! By the time you’re done reading this, you’ll know how to build your own, custom Zig backend!

Credentials

“I Know What I’m Talking About, so Listen Up!”

I’ve been keeping an eye on Zig for about a year now. My first commit to the project was back in October, but I really started getting involved about two months ago. As of the time of writing, only Andrew (the creator and head programmer of Zig, founder of the Zig Software Foundation, and all around great guy) has more commits to the self-hosted compiler than I do (not that commit count is a remotely meaningful metric).

I wrote and de facto maintain the C backend (not that it’s remotely finished :P), I’ve done a lot of work on the testing harness, I helped organize the AMD64 codegen, and I wrote and de facto maintain the even less finished SPU Mark II backend. I definitely don’t know nearly as much as Andrew does about the compiler as a whole (not even close) but I know more than enough to not just write new backends myself, but also to help you do it.

Background Information

Zig is a programming language that seeks to supplant C, much as Rust seeks to supplant C++. As a new programming language, it has a pretty straightforward path to completion:

1) Design the language. This is arguably the single hardest part. There’s a lot of thought that needs to be put into nearly every decision (you can look through all the hundreds of proposals to get a glimpse at this.

2) Write a compiler for Zig in an existing language (in Zig’s case, C++).

3) Using the first compiler (known as “stage 1”), write a new compiler for Zig, this time in Zig. This is the stage 2 (or “self-hosted”) compiler. This… is the future.

The first step is largely done, even if it’s not quite there yet. It’s also worth noting that a lot of the design for the language has come about as a direct result of the work on the self-hosted compiler, and other projects using the language.

The second step — the stage 1 compiler — is also almost done, though there’s a number of features missing and low-priority bugs (plus some high-priority ones here and there — it’s extremely bad with memory usage, for instance). Currently, we’re mostly on the third step: writing the stage 2 compiler.

Basic Stage 2 Structure

Unlike the stage 1 compiler, which was written as the language was designed, the stage 2 compiler was designed specifically for Zig as it exists today, instead of growing organically alongside the language. As such, it has a very, very clean design.

First, the parser reads in Zig source code and builds an “Abstract Syntax Tree,” which is basically an in-memory representation of the raw source code in a more machine-friendly format.

The Zig analyzer iterates over the AST and converts it into an IR, or Intermediate Representation, called ZIR (Zig IR). Where AST matches the original source 1:1, the IR is much closer to machine code, and much easier (and faster) to work with. This IR is untyped.

The ZIR is then passed through the semantic analyzer, which checks types, processes compile-time control flow, inserts safety checks, and so on. The semantic analyzer produces typed IR.

Next, the analyzed IR is passed through to… drum roll… codegen! This is the stage that we care about for backends; it takes the machine-friendly IR and produces raw machine code (AVR, SPU II, WASM, etc). This is what we’ll be focusing on.

Adding a Backend

As a quick note, I’m assuming for the purposes of the article that the SPU II backend has been merged, as it also tweaks the structure of the compiler a bit to make it a bit more friendly.

Adding the Target

I’m going to be focusing on adding new architectures. Adding a new operating system is a somewhat similar process, but not the same. On that note, a minor disclaimer: this isn’t a tutorial. This isn’t a “copy this and you’ll have a backend!!!” situation. You will need to think about this. You will need to put in a lot of effort. Writing a compiler backend for Zig is simple, but simple does not mean easy.

Before you can add a backend, Zig needs to know that the target exists. All targets are placed into std.Target. Note: some targets may already be present in std.Target.Cpu.Arch, but not have backends implemented. If this is the case, you can skip to the next section.

If a target is not present in std.Target.Cpu.Arch, it must be added to the end of the enum. There should be a comment to the effect of Non-LLVM backends here. Anywhere after that comment is fine. This was, in fact, one of the issues that came up when adding the SPU II backend. When Zig’s stage 1 is built, some of the self-hosted code is actually used — specifically, std.Target, and the target parsing code. If you insert a new arch into the middle of the enum, it changes the value of the other architectures — and stage 1 passes the target value to LLVM. Since LLVM expects the definitions to have specific values, modifying any portion of the enumeration which is used by stage 1 will result in a broken stage 1 (and thus, CI failing).

Once the target is added to the enumeration, there’s a few more things that need to be changed. There are a few switches in std.Target that sift through architectures, such as ptrBitWidth (which gives the pointer size of the architecture — which is used for usize). Searching for switch (arch) in the file should find them pretty easily, and if you miss them, you’ll get a helpful error message when you try building the compiler.

If your target utilizes ELF, you’ll also need to make sure there’s a number for the machine type in std.elf.EM. If the target has an official ELF spec, and isn’t already in the enum, use the given number; if this is a custom target, make one up. If you have tooling for the target, the number should be consistent with that expected by the tooling (or you can expect issues).

    /// SPU Mark II. This is not a legitimate number. Don't rely on it.
    _SPU_2 = 13,

There’s also a switch in std.Target that finds the ELF machine number from a std.Target.Cpu.Arch. If your system doesn’t support ELF, you’ll want _NONE.

    .ve => ._NONE,
    .spu_2 => ._SPU_2

There’s yet another switch which detects the dynamic linker path for Linux. For architectures that don’t support Linux, this should be unreachable.

    // These architectures have been confirmed as not supporting Linux.
    .avr,
    .spu_2,
    => unreachable,

Otherwise, you’ll want something like this:

.arch => return copy(&result, "/lib/ld-linux.so.2"),

Congratulations, now Zig knows about another Target! This is the easy part, though. Now, for the actual work.

Adding Basic Codegen Information

Before you can work on codegen proper, you need to give codegen a bit more information about the target architecture. Create a file codegen/my_arch.zig that looks something like this:

// We need to tell codegen what registers are available for e.g. register
// allocation
pub const Register = enum {
    // There are no registers, but an enum requires at least one member, so we
    // have a dummy entry
    dummy,
    
    // Maps a register to an index in callee_preserved_regs. `null` indicates
    // that a register is not present in callee_preserved_regs.
    pub fn allocIndex(self: Register) ?u4 {
        return null;
    }
};
  
// A list of registers to be saved by called functions (as opposed to the
// caller)
pub const callee_preserved_regs = [_]Register{};

You’ll need to adjust it to match the target architecture. x86 for instance includes 24 registers, and defines callee_preserved_regs as

pub const callee_preserved_regs = [_]Register{ .eax, .ecx, .edx, .esi, .edi };

Next, you need to tell the general codegen where to find this information. There’s a switch at the end of codegen.Function(arch) which embeds it.

         usingnamespace switch (arch) {
             ...
             .spu_2 => @import("codegen/spu-mk2.zig"),
             ...

The last boilerplatey thing to do is go to codegen.generateSymbol’s per-architecture switch.

         .Fn => {
             switch (bin_file.options.target.cpu.arch) {
                 ...
                 .spu_2 => return Function(.spu_2).generateSymbol(bin_file, src, typed_value, code),
                 ...

This usage of Function(.spu_2) calls the generic Function with spu_2 as the target architecture, which triggers all of the switch(arch)s we modified earlier to include the correct values.

Now, we’re at the fun part!

Iterative Development

The best way to develop a backend without issues is to take advantage of the test harness that Andrew and I designed. It makes it very easy to test successively larger pieces of code and slowly get the full backend going.

Tests get sorted by test type within test/stage 2/. For now, we’ll focus on test/stage 2/compare_output.zig. Cases here are sorted by architecture. Look for a comment that says something like $ARCH tests here.

Tests look a little something like this:

     // SPU-II tests go here
     {
         var case = ctx.exe("SPU-II Basic Test", spu);
         case.addCompareOutput(
             \\export fn _start() noreturn {
             \\    asm volatile ("undefined0");
             \\    unreachable;
             \\}
         , "");
     }

Let’s break it down. First, we have ctx, the TestContext. The TestContext is the heart of the test harness. It manages the test cases, tracks progress, and runs all the individual tests. We’ll come back to it in a little bit, but for now, all we need to know is that it has a number of convenience functions to make our lives a lot easier.

         var case = ctx.exe("SPU-II Basic Test", spu);

ctx.exe, for instance, creates a test case named “SPU-II Basic Test”, which parses Zig source code and produces a binary for the specified target — in this case, spu. spu is defined near the top of the file.

const spu = std.zig.CrossTarget{ .cpu_arch = .spu_2, .os_tag = .freestanding };

Okay, so now we have a test case. We need to tell it what to actually test!

         case.addCompareOutput(
             \\export fn _start() noreturn {
             \\    asm volatile ("undefined0");
             \\    unreachable;
             \\}
         , "");

Cases consist of a series of updates, which are processed via incremental compilation. Updates can test for compiler errors, ensure that code produces the correct output, and more. CompareOutput updates build the provided source code, run it, and verify that the expected output is received.

In this case (pun intended), the expected output is an empty string — we’re not expecting any output! What this means is that the test is checking that it can compile and run the given code — in fact, if any output is received, the test will fail!

Let’s take a look at the example source.

export fn _start() noreturn {
    asm volatile("undefined0");
    unreachable;
}

This a single function, which does not ever return. It uses inline assembly (undefined0), and the unreachable afterwards asserts that the asm ends execution. What’s really going on here?

undefined0 is an SPU II instruction which, technically, does nothing. If you use it on an actual SPU II, there’s no telling what will happen. The test suite (ab)uses it as a “kill the emulator” marker.

This test is the best first choice because it is about the simplest test you can possibly write. To understand why, I need to admit something: I lied a bit earlier. The analyzed IR actualy gets to the codegen through the linker, which manages the output buffer directly.

We’re currently using the ELF linking-backend, which takes produced code and stores it in an ELF executable. All decls are routed through link.File.ELF.updateDecl, which is what calls codegen.generateSymbol. generateSymbol’s job is to put the machine code for a specific function into a provided buffer, and that’s it. We don’t have to care about export fn _start() noreturn ; that’s handled elsewhere. The linker sets aside space for it, and gives us a buffer to put the code into. All we have to care about is this:

asm volatile("undefined0");
unreachable;

This gives us two instructions, one assembly and one unreach. All instructions go through genFuncInst, which determines what we’re generating. For assembly, we hit this branch:

                 .assembly => return self.genAsm(inst.castTag(.assembly).?),

We’ll come back to this in just a second. For unreachable, however, we hit this branch:

               .unreach => return MCValue{ .unreach = {} },

We don’t have to care about unreachable behavior because it’s handled for us, elsewhere. Unreachables have the same behavior on every single platform.

Let’s get back to genAsm. genAsm’s implementation boils down to this:

          switch (arch) {
               ...
               .spu_2 => {
                   if (inst.inputs.len > 0 or inst.output != null) {
                       return self.fail(inst.base.src, "TODO implement inline asm inputs / outputs for SPU Mark II", .{});
                   }
                   if (mem.eql(u8, inst.asm_source, "undefined0")) {
                       try self.code.resize(self.code.items.len + 2);
                       var instr = Instruction{ .condition = .always, .input0 = .zero, .input1 = .zero, .modify_flags = false, .output = .discard, .com
                       mem.writeIntLittle(u16, self.code.items[self.code.items.len - 2 ..][0..2], @bitCast(u16, instr));
                       return MCValue.none;
                   } else {
                       return self.fail(inst.base.src, "TODO implement support for more SPU II assembly instructions", .{});
                   }
               },
               ...
           }

To add support for a given architecture, you just need to add a branch for it. Depending on the system, you may need to process inputs; since for SPU II we’re just using undefined0, which doesn’t need any, we can skip it for now and come back to it later.

Any given architecture’s block in the switch just needs to check what the instruction is, and add it to the buffer the linker gave us. For undefined0, we set aside two bytes (all instructions on SPU II are two bytes plus up to two two-byte arguments; we take no arguments, so it’s just two bytes). We use the convenient definitions I added to std.spu to write the instruction instead of writing it out by hand, and then use std.mem.writeIntLittle to ensure the instruction is correctly placed into the buffer, even if we’re on a big-endian system. undefined0 ends execution, and thus returns no value, so it returns a Machine Code Value (MCValue) of none.

That’s it! That gets the first test… building. For a target like the SPU II, for which we need to use an emulator to run tests, things get a bit more tricky, and intensely target specific. Remember how I mentioned that CompareOutput tests execute the binary? Well, if it’s not a native compilation, we can’t do that directly. Some targets may have QEMU support; you might want to integrate QEMU via std.ChildProcess. SPU II does not, but is an extremely simple architecture, so I simply wrote an interpreter (which will be exposed in std.spu.interpreter once the backend is a bit more well-tested).

While the implementation is target specific, I did put in some effort to make the process of integration simple and straightforward. We used CompareOutput as the type of the Case’s only Update, but despite exposing a number of different Update types, there are only actually three internally: Transformations, which take Zig or ZIR and transforms the input; Errors, which attempt to compile bad code and make sure that the expected errors are triggered; and Executions, which actually run the code.

If you look at TestContext.runOneCase, you’ll notice a very simple structure: it loops over all of a Case’s Updates, performs incremental compilation, ensures it didn’t fail unexpectedly, and then switches over the Update type. Since only one test type has to actually execute the produced code, we only have to deal with one code path here.

The Execution branch of the switch looks like this:

std.debug.assert(!case.cbe);
if (case.target.isNative() or (case.target.cpu_arch.? == std.builtin.cpu.arch and case.target.os_tag.? == std.builtin.os.tag)) {
	...
}
else {
	switch (case.target.cpu_arch.?) {
		.spu_2 => {
			// Run the SPU II interpreter
		},
		...
	}
}

To add support for testing an architecture non-natively, or in freestanding mode (think x86 kernel code, running in QEMU), simply add a branch here which runs the binary and retrieves its output. For QEMU, you could for instance set the serial port to go to stdio and largely copy the native target. The path to the binary is available as exe_path. Once the test harness is set up, you can add as many tests as you want without any hassle.

To make sure the test harness works, I recommend adding tests which you know should fail alongside those that should pass, and testing what happens. To run the test suite, run ./zig build test-stage 2 from the Zig build folder. If everything was done correctly, you should see something like this:

All 1 tests passed.

The 1 test refers to the entire stage 2 test suite. If you see this, it means everything’s good, and you can submit your patches to the project!

At this point, you’ll have a backend which can utilize a single inline asm instruction, and nothing else. From here, you’re going to want to keep adding more test cases, and then implementing whatever TODOs you trigger. A good second test for instance is this:

fn endTest() noreturn {
	asm("undefined0"); // replace this with the asm to 1
	unreachable;
}

export fn _start() noreturn {
	endTest();
}

This should trigger a TODO when generating the function call, which should be relatively straightforward for a no-parameter case. Then you can work on arithmetic, parameter passing, returning values, and so on!

Summary

Backends are actually really straightforward to add to Zig’s stage 2 compiler. There’s a simple checklist for targets which support ELF.

First, there’s the “Notice me, com-pai-ler” phase:

  • Add basic info to std.Target and std.elf.EM
  • Provide basic information about registers to codegen
  • Set up a test harness. This can use QEMU, wasmtime, an interpreter in the stage 2 source, or something else entirely.

Once that’s done, there’s the “codegen is love. codegen is life.” phase:

  • Add tests with more functionality
  • Find TODOs / pieces of the backend that are not yet implemented
  • Implement them!
    • Fail to implement them.
    • Read the manual
    • If ARCH == x86
      • Cry
      • Read it again
      • Curse out Intel and AMD
      • Sorta understand it
      • Repeat a few times
    • Understand the manual
    • Implement them! Finally! You showed that stupid spec who’s boss!
  • Get the tests passing!
  • Send patches upstream
  • Repeat until there are no TODOs left!

Thanks for reading, and I look forward to seeing your patches! As I said earlier, don’t hesitate to reach out to me on IRC or over email if you need any help, found mistakes in this post, or want to rant angrily at someone :)

Tips and Tricks

  • Take advantage of incremental compilation wherever possible

Credit for this one, unsurprisingly, goes to Andrew; he mentioned something about this on IRC the other day.

Incremental compilation means that less work has to be done on each successive iteration. Let’s take a look at a simple example:

// Compatible API with other platforms, we ignore the code
fn exit(code: usize) noreturn {
	asm("undefined0");
	unreachable;
}

export fn _start() noreturn {
	exit(0);
}

Let’s say this is the first Update to a Case. Everything here will be compiled. Let’s now add a second case:

fn add(a: usize, b: usize) usize {
	return a + b;
}

// Compatible API with other platforms, we ignore the code
fn exit(code: usize) noreturn {
	asm("undefined0");
	unreachable;
}

export fn _start() noreturn {
	// Now, we're testing addition too
	exit(add(3, 4));
}

This second test would, normally, have every piece compiled, just as in the former one. If we add it as a second Update to the case used for the first test, however, exit will not be analyzed at all, resulting in a faster test suite — which benefits all of us working on the compiler! A better example would utilize e.g. the standard library, but we don’t support packages in stage 2 yet.

  • Don’t make assumptions. If you aren’t sure about something, ask.

Compiler code can be tricky, but it doesn’t have to be. If you need help with something, ask. It’s better to take the time to be certain in what you’re doing now than to have to deal with cleaning up broken code later. We don’t bite, and we’re happy to help if you need it!

  • Take a look at how other backends handle a problem

Since the compiler utilizes, for instance, incremental compilation, there are some behaviors that might seem a bit odd. Looking at how other backends handle any given issue can help immensely.

For instance, all function calls in debug mode go through a Global Offset Table, or GOT. This allows us to update functions without having to update every function that calls them — we just change the address in the table. Instead of calling the function’s address, you load the address of the function from (got + function_index) and call that.

This might not be obvious when you’re writing the codegen for a target. If you look at how e.g. the x64 backend handles it, you should find comments explaining what’s going on. You can then either figure out what to do yourself, or — going back to the previous tip — realize you aren’t sure what to do, and ask what’s going on.

Additional SPU II Backend Details

There’s a few more fun bits about the SPU II backend that I didn’t include in the “How To” section, so if you’re interested, here they are; if you’re not, there’s not really anything else left for you.

Quick Instruction Set Rundown

The SPU-II pipeline is broken down into a few distinct stages: condition, input, command, and output. Every instruction includes an execution condition (the architecture is fully predicated). If the condition is false, it’s effectively a nop. Every instruction has two inputs, which can be either zero, a 16-bit immediate, or a value from the stack (optionally popped). Every instruction’s command produces an output, and the output is handled according to the Instruction’s output behavior. The output can be discarded, pushed to the stack, or used as a jump target (either absolute or relative). Instructions can also choose whether or not to modify the flags register based on the produced output.

Function Calls

As I mentioned in the last tip, function calls go through a table. It is basically just, in Zig terms, the following:

const got: [:0]usize = ...;

I use usize here because the type of the function isn’t included in the table, only its address. It’s also worth noting that it’s perfectly valid for the offset table to contain a zero address (a function can get placed at address zero on some platforms), so the type doesn’t quite match up.

Regardless, all function calls load their target address from that table. For instance, to call the endTest function in the tests, we don’t call endTest directly, we load its address through the table.

To call a function directly, we’d push the return address, then use an instruction with an immediate input (the jump address), the copy command (copy’s output is the first input), and the jump output behavior (and the condition set to always for an unconditional call).

That’s two instructions: the push of the return address, and the absolute jump.

Calling a function via the offset table also requires two instructions. Instead of jumping to an immediate, we use a load16 with an immediate, and jump to that. load16 takes in an address as input, and its output is the value at that address. So, if the offset table is at 0x4000, and the function we’re calling is at the sixth index, we load16 the address 0x400C (pointers are 16-bit, so six times two = twelve bytes deep in the table), and jump to it. This is exactly as fast as the call would have been directly; the only cost is two bytes of RAM per function.

std.spu.Instruction

In addition to providing an interpreter, with the new SPU II backend, std.spu exposes an extremely useful define which was stolen copied with full permission from Felix’s spu-mark-ii repository: Instruction. Unlike x64, which currently has raw hex encodings for all instructions, SPU II codegen is downright beautiful:

    var instr = Instruction{ .condition = .always, .input0 = .immediate, .input1 = .zero, .modify_flags = false, .command = load16, .output = jump };
    const jump = @bitCast(u16, instr);
    try self.code.resize(self.code.items.len + 4);
    mem.writeIntLittle(u16, self.code.items[self.code.items.len - 4 ..][0..2], @bitCast(u16, jump));
    mem.writeIntLittle(u16, self.code.items[self.code.items.len - 2 ..][0..2], got_addr);

Note: the var is needed because compile-time bitcasting of packed structure to u16 is a TODO in stage 1.

We know exactly what the instruction will do, without needing comments, because it’s spelled out explicitly. In contrast, the x64 backend is full of comments like these:

     // The encoding for `xor r32, r32` is `0x31 /r`.
     // Section 3.1.1.1 of the Intel x64 Manual states that "/r indicates that the
     // ModR/M byte of the instruction contains a register operand and an r/m operand."
     //
     // R/M bytes are composed of two bits for the mode, then three bits for the register,
     // then three bits for the operand. Since we're zeroing a register, the two three-bit
     // values will be identical, and the mode is three (the raw register value).

I spent a good chunk of time adding in those comments, but the fact is that with definitions like those used for the SPU II backend, they aren’t needed. Heck, it might be worth going through the x64 backend and doing something similar to std.spu.Instruction!

ELF16

ELF16 isn’t a real format. Instead, it’s ELF32 with 16-bit pointers. That’s it.


Articles from blogs I follow around the net

Fix tests with block sizes [2..5]

Ensures that the JIT and interpreter correctly run in tandem. Block sizes 2 through 5 now work correctly for the first test without any issues.

via ~pixelherodev/spu-ii log July 27, 2020

The falsehoods of anti-AGPL propaganda

Google is well-known for forbidding the use of software using the GNU Affero General Public License, commonly known as “AGPL”. Google is also well-known for being the subject of cargo-culting by fad startups. Unfortunately, this means that they are susceptib…

via Drew DeVault's Blog July 27, 2020

Status update, July 2020

Hi all! It’s time for another monthly status update. Yesterday I’ve released wlroots 0.11.0 and Sway 1.5! This is a pretty big release, with lots of new features and bug fixes. New features include headless outputs that can be created on-the-fly (one use-cas…

via emersion July 21, 2020

Generated by openring