The Design of the Go Assembler

Gophercon

12 July 2016

Rob Pike

Google

Presentation on youtube.com

Video is here.

Motivation

Why Learn Assembler Language?

The most important single thing to realize about assembler language is that it enables the programmer to use all System/360 machine functions as if he were coding in System/360 machine language.

— A Programmer's Introduction to IBM System/360 Assembler Language, 1970, page 4

We still need assembly language

Once it was all you needed, then high-level languages like FORTRAN and COBOL came along.

But still needed today:

environment bootstrap (operating system and program startup, runtime)
low-level library code such as stack management and context switching
performance (math/big)
access to features not exposed in language such as crypto instructions

Also, perhaps most important: It's how we talk about the machine.

Knowing assembly, even a little, means understanding computers better.

What does it look like?

Some examples...

IBM System/360

1        PRINT NOGEN
2 STOCK1 START 0
3 BEGIN  BALR  11,0
4        USING *,11
5        MVC   NEWOH,OLDOH
6        AP    NEWOH,RECPT
7        AP    NEWOH,ISSUE
8        EOJ
11 OLDOH DC    PL4'9'
12 RECPT DC    PL4'4'
13 ISSUE DC    PL4'6'
14 NEWOH DS    PL4
15       END   BEGIN

Apollo 11 Guidance Computer

# TO ENTER A JOB REQUEST REQUIRING NO VAC AREA:

          COUNT     02/EXEC
                
NOVAC     INHINT
          AD        FAKEPRET     # LOC(MPAC +6) - LOC(QPRET)
          TS        NEWPRIO      # PRIORITY OF NEW JOB + NOVAC C(FIXLOC)

          EXTEND
          INDEX     Q            # Q WILL BE UNDISTURBED THROUGHOUT.
          DCA       0            # 2CADR OF JOB ENTERED.
          DXCH      NEWLOC
          CAF       EXECBANK
          XCH       FBANK
          TS        EXECTEM1
          TCF       NOVAC2       # ENTER EXECUTIVE BANK.

PDP-10

TITLE   COUNT
 
A=1                             ;Define a name for an accumulator.

START:  MOVSI A,-100            ;initialize loop counter.
                                ;A contains -100,,0
LOOP:   HRRZM A,TABLE(A)        ;Use right half of A to index.
        AOBJN A,LOOP            ;Add 1 to both halves (-77,,1 -76,,2 etc.)
                                ;Jump if still negative.
        .VALUE                  ;Halt program.

TABLE:  BLOCK 100               ;Assemble space to fill up.

END START                       ;End the assembly.

(From the MIT PDP-10 Info file)

PDP-11

/ a3 -- pdp-11 assembler pass 1

assem:
        jsr     pc,readop
        jsr     pc,checkeos
        br      ealoop
        tst     ifflg
        beq     3f
        cmp     r4,$200
        blos    assem
        cmpb    (r4),$21   /if
        bne     2f
        inc     ifflg
2:
        cmpb    (r4),$22   /endif
        bne     assem
        dec     ifflg
        br      assem

(From Unix v6 as/as13.s)

Motorola 68000

strtolower      public
                link    a6,#0           ;Set up stack frame
                movea   8(a6),a0        ;A0 = src, from stack
                movea   12(a6),a1       ;A1 = dst, from stack
loop            move.b  (a0)+,d0        ;Load D0 from (src)
                cmpi    #'A',d0         ;If D0 < 'A',
                blo     copy            ;skip
                cmpi    #'Z',d0         ;If D0 > 'Z',
                bhi     copy            ;skip
                addi    #'a'-'A',d0     ;D0 = lowercase(D0)
copy            move.b  d0,(a1)+        ;Store D0 to (dst)
                bne     loop            ;Repeat while D0 <> NUL
                unlk    a6              ;Restore stack frame
                rts                     ;Return
                end

(From Wikipedia)

CRAY-1

ident slice
         V6        0               ; initialize S
         A4        S0              ; initialize *x
         A5        S1              ; initialize *y
         A3        S2              ; initialize i
loop     S0        A3
         JSZ       exit            ; if S0 == 0 goto exit
         VL        A3              ; set vector length
         V11       ,A4,1           ; load slice of x[i], stride 1
         V12       ,A5,1           ; load slice of y[i], stride 1
         V13       V11 *F V12      ; slice of x[i] * y[i]
         V6        V6 +F V13       ; partial sum
         A14       VL              ; get vector length of this iteration
         A4        A4 + A14        ; *x = *x + VL
         A5        A5 + A14        ; *y = *y + VL
         A3        A3 - A14        ; i = i - VL
         J        loop
 exit

(From Robert Griesemer's PhD thesis)

Common structure

Columnar layout with function and variable declarations, labels, instructions.

Instructions:

subroutine header
label:
    instruction operand...    ; comment
    ...

Operands:

register
literal constant
address
register indirection (register as address)
...

There are exceptions such as Cray (A5 A5+A14) but they aren't conceptually different.

CPUs are all pretty much the same.

Use that commonality

We can use the common structure of all assemblers (CPUs, really) to construct a common grammar for all architectures.

This realization took some time.

The seeds were planted long ago.

Plan 9 assembly

Around 1986, Ken Thompson wrote a C compiler for the National 32000 (Sequent SMP).
Compiler generated pseudo-code, linker did instruction assignment.

The "assembler" was just a way to write that pseudo-code textually.

MOVW    $0, var

might become (hypothetical example)

XORW    R1, R1
STORE   R1, var

Note assembler emits the MOVW; the linker generates XORW and STORE.
We call this instruction selection.

Or consider RET, which becomes RET or JMP LR or JMP (R31) or ...

The assembler is just a way to hand-write the output the compiler produces.
(Compiler does not feed assembler, unlike in many other systems.)

The Plan 9 assemblers

Assembler for each architecture was a separate C program with a Yacc grammar,
adapted and partially rewritten for every architecture.

8a, 6a, va etc. corresponding to 8c, 6c vc, etc.
(One-letter codes: 8 for 386, 6 for AMD64, v for MIPS, etc.)

All very similar up front but different in detail.

The earliest Go implementations used this design, adding Go compilers 8g, 6g but using the Plan 9 assemblers unchanged.

The separation of (compiler/assembler)⇒linker allowed the Go linker to do more, including helping boot the runtime.

Go 1.3: Rearrange the pieces

Goal: Move to a pure Go implementation.
Preparation started in Go 1.3

New library that (in part) does instruction selection: "liblink" (as of 1.5, "obj").
Call it from the compiler.

Thus the first part of the old linker is now in the compiler.
The compiler now emits (mostly) real instructions, not pseudo-instructions.

Result: Slower compiler, but faster build.
Instruction selection for library code done once, not every time you link a program.

Assemblers also use obj.

For both compiler and assembler, the input is unchanged.
In fact the whole process is the same, just arranged differently.

Go 1.5: C must Go

More prep in Go 1.4, then in Go 1.5, all tooling moved to Go.

Compiler and linker machine-translated from C to Go.
The old liblink became a new suite of libraries, obj/...:

cmd/internal/obj (portable part)
cmd/internal/obj/x86 (architecture-specific part)
cmd/internal/obj/arm (architecture-specific part)
...

Previous presentations about this work:

Russ Cox at Gophercon 2014 (out of date): youtube.com/watch?v=QIE5nV5fDwA
Rob Pike at Gopherfest 2015: youtube.com/watch?v=cF1zJYkBW4A

Go 1.5: Compiler and linker as single programs

The many compilers (6g, 8g etc.) were replaced with a single tool: compile.
GOOS and GOARCH (only!) specify the target operating system and architecture.

GOOS=darwin GOARCH=arm go tool compile prog.go

Same for the linker: 6l, 8l, etc. become go tool link.

How can a single binary handle all these architectures?

Only one input language, only one output generator (the obj library).
The target is configured when the tool starts.

Go 1.5 Assembler

Unlike the old compilers, which shared much code, the old assemblers were all different programs.
(Although they were very similar inside, they shared almost no code.)

Proposal: Write a single go tool asm from scratch in Go, replacing all the old assemblers.

GOOS and GOARCH tell you what the target is.

But assembly language isn't Go. Every machine has a different assembly language.

Well, not really! Not quite universal across machines, but ...

An example program

Look at the generated assembly for this simple program:

package add

func add(a, b int) int {
    return a + b
}

For each architecture, with some noise edited out:

32-bit x86 (386)

TEXT add(SB), $0-12
    MOVL    a+4(FP), BX
    ADDL    b+8(FP), BX
    MOVL    BX, 12(FP)
    RET

64-bit x86 (amd64)

TEXT add(SB), $0-24
    MOVQ    b+16(FP), AX
    MOVQ    a+8(FP), CX
    ADDQ    CX, AX
    MOVQ    AX, 24(FP)
    RET

32-bit arm

TEXT add(SB), $-4-12
    MOVW    a(FP), R0
    MOVW    b+4(FP), R1
    ADD     R1, R0
    MOVW    R0, 8(FP)
    RET

64-bit arm (arm64)

TEXT add(SB), $-8-24
    MOVD    a(FP), R0
    MOVD    b+8(FP), R1
    ADD     R1, R0
    MOVD    R0, 16(FP)
    RET

S390 (s390x)

TEXT add(SB), $0-24
    MOVD    a(FP), R1
    MOVD    b+8(FP), R2
    ADD     R2, R1, R1
    MOVD    R1, 16(FP)
    RET

64-bit MIPS (mips64)

TEXT add(SB), $-8-24
    MOVV    a(FP), R1
    MOVV    b+8(FP), R2
    ADDVU   R2, R1
    MOVV    R1, 16(FP)
    RET

64-bit Power (ppc64le)

TEXT add(SB), $0-24
    MOVD    a(FP), R2
    MOVD    b+8(FP), R3
    ADD     R3, R2
    MOVD    R2, 16(FP)
    RET

Common grammar

They all look the same. (Partly by design, partly because they are the same.)

The only significant variation is the names of instructions and registers.
Many details hidden, such as what RET is. (It's a pseudo-instruction.)

(Offsets are determined by size of int, among other things.)

The fortuitous syntax originated in Ken's National 32000 assembler.

With common syntax and the obj library, can build a single assembler for all CPUs.

Aside: Downside

Not the same assembly notation as the manufacturers'.
Can be offputting to outsiders.

On the other hand, this approach uses the same notation on all machines.
New architectures can arrive without creating or learning new notation.

A tradeoff worth making.

Design of the Go 1.5 assembler

The apotheosis of assemblers.

New program, entirely in Go.

Common lexer and parser across all architectures.
Each instruction parsed into an instruction description.
That becomes a data structure passed to the new obj library.

The core of the assembler has very little per-machine information.
Instead, tables are constructed at run time, flavored by $GOARCH.

An internal package, cmd/asm/internal/arch, creates these tables on the fly.
Machine details are loaded from obj.

An example: initializing the 386

import (
    "cmd/internal/obj"
    "cmd/internal/obj/x86"
)

func archX86(linkArch *obj.LinkArch) *Arch {
    register := make(map[string]int16)
    // Create maps for easy lookup of instruction names etc.
    for i, s := range x86.Register {
        register[s] = int16(i + x86.REG_AL)
    }
    instructions := make(map[string]obj.As)
    for i, s := range obj.Anames {
        instructions[s] = x86.As(i)
    }
    return &Arch{
        Instructions:   instructions,
        Register:       register,
        ...
    }
}

Parser just does string matching to find the instruction.

An example: ADDW on 386

Given an assembly run with GOOS=386, the instruction

ADDW AX, BX

is parsed into in a data structure schematically like:

&obj.Prog{
    As: arch.Instructions["ADDW"],
    From: obj.Addr{Reg: arch.Register["AX"]},
    To: obj.Addr{Reg: arch.Register["BX"]},
    ...
}

That gets passed to the obj library for encoding as a 386 instruction.

This is a purely mechanical process devoid of semantics.

Validation

Assembler does some validation:

lexical and syntactic correctness
operand syntax
(with some variation. e.g.: [R2,R5,R8,g] only legal on ARM)

But all semantic checking is done by the obj library.

If it can be turned into real instructions, it's legal!

Testing

New assembler was tested against the old (C-written) ones.

A/B testing at the bit level: Same input must give same output.
Also reworked some parts of obj packages for better diagnostics and debugging.

Did 386 first, then amd64, arm, and ppc. Each was easier than the last.

No hardware manuals were opened during this process.

Result

One Go program replaces many C/Yacc programs, so it's easier to maintain.
As a Go program it can have proper tests.

Dependent on obj, so correctness and completeness are relatively simple to guarantee.

New assembler almost 100% compatible with previous ones.
Incompatibilities were mostly inconsistencies.

Portability is easy now.

A new instruction set just needs connecting it up with the obj library,
plus a minor amount of architecture-specific tuning and validation.

Several architectures have been added since the assembler was created,
most by the open source community.

Tables

To a large extent, the assembler is now table-driven.
Can we generate those tables?

The disassemblers (used by go tool pprof) are created by machine processing of PDFs.
The architecture definition is machine-readable, so use it!

Plan to go the other way:

Read in a PDF, write out obj library definitions and bind to assembler.
Why write by hand when you can automate?

Hope to have this working soon; basics are already in place.

Result: a largely machine-generated assembler.

Conclusion

Assembly language is essentially the same everywhere.

Use that to build a true common assembly language.

Customize it on the fly using dynamically loaded tables.

And one day: create those tables automatically.

A portable solution to a especially non-portable problem.

Thank you