I haven’t written much about my VHDL compiler recently and there has been little public GitHub activity. This is because I’ve spent the last six months completely rewriting the way the back-end code generation works and I finally merged the result back into the master branch last week. This rather lengthy post attempts to explain what I’ve done.

There were numerous problems with the old code generator. Before the rewrite cgen.c was a 7000 line behemoth that took VHDL syntax trees at one end and produced a complete LLVM module at the other. It dealt with everything from the minutiae of memory layout and branch prediction hints, to how to implement displays for nested sub-programs, to high level VHDL rules for things like bounds checking.

It was also bordering unmaintainable as it had started life as a small toy project to learn LLVM and slowly accreted features and optimisations as I implemented more and more of VHDL. There were several bugs in it that were just too painful to fix.

It was also very difficult to test: most of the rest of the modules are tested using the check C unit test framework but I couldn’t find any way to do this with the LLVM output. Instead the code generator is tested indirectly by the regression tests that run through the full compiler and simulator: this makes it hard to test for small specific features, like optimisations.

I also had several other plans for VHDL-specific optimisations and unifying the elaboration-time evaluator with the code generator that just weren’t tractable with the current system. For example the old code generator threw away most VHDL type information early on and reduced variables and signals to LLVM’s 8, 16, 32, or 64-bit integers which misses many opportunities to eliminate bounds checks.

So I finally decided to rewrite it using a scheme I’d been toying with for a while. The back end is now based around a new intermediate form called “V-code” after the P-code of Pascal. A new “lowering” phase in lower.c translates VHDL syntax trees into V-code and the code generator in cgen.c translates V-code into LLVM.

The intention of the lowering phase is to translate VHDL into something executable without getting bogged down with implementation concerns. The code generation phase focusses on generating efficient LLVM bitcode without worrying about high-level features.

I wanted V-code to be higher-level than LLVM – so that I could express VHDL features like signals, processes suspending, bounds checking etc. – but lower level than VHDL so I could hide complexities such as array directions, aggregates, non-zero-based indices, and so on.

I think the V-code language I ended up with fits this pretty well. The model is infinite register single static assignment just like LLVM but simplified in many ways – for example there are no phi nodes. V-code is strongly typed having both “normal” types such as integers, pointers, records, and arrays and VHDL-specific types such as signals, accesses, and files.

Unlike many similar intermediate languages where integer types are a certain bit width, in V-code integer registers have two ranges associated with them: a static “type” range and a dynamic “bound” range. For example, consider the following snippet:

subtype small_int is integer range 1 to 10;
 
function get_bit(x : bit_vector(1 to 20);
                 n : small_int) return bit is
    variable i : integer := n;
begin
    return x(i + 1);
end function;

If we compile this with the --dump-vcode argument the compiler will output the V-code:

Name       WORK.PACK.GET_BIT$JQuWORK.PACK.SMALL_INT;
Kind       function
Context    WORK.PACK-body
Blocks     1
Registers  12
Types      9
Variables  1
  I                                     // -2^31..2^31-1 
Result     0..1
Parameters 2
  r0    X                               // @<0..1> => 0..1 
  r1    N                               // -2^31..2^31-1 => 1..10 
Begin
   0: // Elided bounds check for r1 
      I := store r1
      r3 := const 1                     // -2^31..2^31-1 => 1 
      r4 := add r1 + r3                 // -2^31..2^31-1 => 2..11 
      // Elided bounds check for r4 
      r8 := sub r4 - r3                 // -2^31..2^31-1 => 1..10 
      r9 := cast r8                     // # => 1..10 
      r10 := add r0 + r9                // @<0..1> => 0..1 
      r11 := load indirect r10          // 0..1 
      // Elided bounds check for r11 
      return r11

The comments after each register definition give the type range and the bound range after a “=>” if different. The parameter N represented by register r1 has a type range of -2^31..2^31-1 as it’s a subtype of INTEGER but a bound range of 1..10 which means the compiler can assume its value is always within this range.

Here the old code generator would have inserted three bounds checks: one on the array access and two trivial subtype checks on I‘s initial value and the returned value. The new code generator can eliminate all three as indicated by the “Elided bounds check for …” comments.

The interesting case is the array access: the expression I + 1 ends up in register r4. This has type range -2^31..2^31-1 again so it’ll be represented by a 32-bit integer but the bound range is 2..11 which is calculated from the range of the LHS (1..10) and the RHS (1..1). When we compare this against the array range 1..20 we know this check can never fail and so skip it.

If we change the return statement to this:

return x(i + 11);

Then r4 has range 11..20 and it’s possible for this array access to be out-of-bounds so the lower phase inserts a runtime check:

   0: // Elided bounds check for r1 
      I := store r1
      r3 := const 11                    // -2^31..2^31-1 => 11 
      r4 := add r1 + r3                 // -2^31..2^31-1 => 12..21 
      r5 := const 1                     // -2^31..2^31-1 => 1 
      bounds r4 in 1 .. 20
      r9 := sub r4 - r5                 // -2^31..2^31-1 => 11..20 
      r10 := cast r9                    // # => 11..20 
      r11 := add r0 + r10               // @<0..1> => 0..1 
      r12 := load indirect r11          // 0..1 
      // Elided bounds check for r12 
      return r12

As an aside, I spent a lot of time making the debug output from V-code human readable. The output is colour-coded and all register definitions are annotated with their type and bounds. I’ve wasted too much time trying to parse the LLVM bitcode textual format!

At the moment bounds propagation only occurs for certain arithmetic operations and only across a single basic block. However this seems to be effective for a large number of practical examples. Another neat side-effect of this scheme is that V-code has constant folding almost for free: a “constant” is any register with a bound range containing a single value. The emit_* functions which generate op-codes typically short-circuit and return a constant register when all their inputs are constants.

The following example shows how V-code represents VHDL-specific features like signals and process control flow.

architecture rtl of example is
    signal x : bit;
begin
    process is
    begin
        x <= '1';
        wait for 1 ps;
        x <= '0';
        wait for 2 ps;
    end process;
end architecture;

The wait-for statement here causes the process to suspend for the given amount of simulation time and then resume at the following statement. In NVC this is implemented as a jump table at the start of a function generated for a process. A wait statement stores the index of the next block in the process’s global state object and returns from the function. This works well but trying to debug the control flow of the generated LLVM can be very difficult. Instead V-code has “waiting” as a first-class control flow op-code: it looks just like a normal jump and the implementation details are hidden in the code generator.

Name       :example:line_7
Kind       process
Context    WORK.EXAMPLE.elab
Blocks     4
Registers  14
Types      8
Variables  0
Begin
   0: r0 := nets :example:x             // $<0..1> => 0..1 
      r1 := const 1                     // # => 1 
      alloc driver nets r0+r1 driven r0+r1
      return 
   1: r2 := const 0                     // -2^63..2^63-1 => 0 
      r3 := nets :example:x             // $<0..1> => 0..1 
      r4 := const 1                     // 0..1 => 1 
      // Elided bounds check for r4 
      r5 := const 0                     // -2^63..2^63-1 => 0 
      r6 := const 1                     // # => 1 
      sched waveform r3 count r6 values r4 reject r2 after r5
      r7 := const 1000                  // -2^63..2^63-1 => 1000 
      wait 2 for r7
   2: r8 := const 0                     // -2^63..2^63-1 => 0 
      r9 := nets :example:x             // $<0..1> => 0..1 
      r10 := const 0                    // 0..1 => 0 
      // Elided bounds check for r10 
      r11 := const 0                    // -2^63..2^63-1 => 0 
      r12 := const 1                    // # => 1 
      sched waveform r9 count r12 values r10 reject r8 after r11
      r13 := const 2000                 // -2^63..2^63-1 => 2000 
      wait 3 for r13
   3: jump 1

Operations on signals are also first-class: the type annotation $<0..1> means a signal of values in the range 0..1 (i.e. the bit type). The op-codes “alloc driver” and “sched waveform” handle common operations on signals without confusing the logic with implementation details (both of these are implemented as function calls into the runtime kernel).

The new structure allows me to write much more targeted unit tests. For example the new test_lower.c test suite checks the generated op-codes for a large number of small examples. This means regression tests can be written for micro-optimisations and similar tweaks that were impossible to test before.

Overall the performance is slightly better than with the previous code generator: the bigram.vhd micro-benchmark runs about 10% faster now. This test originally spent most of its time in the runtime kernel but after I optimised that it is now dominated by the to_unsigned library function so there’s room for code generation improvements there. However, I want to spend the next month or so fixing the large number of GitHub issues that have accumulated.