Lexical Scanning in Go

GTUG Sydney

30 August 2011

Rob Pike

Video

A video of this talk was recorded at the Go Sydney Meetup.

Watch the talk on YouTube

Structural mismatch

Many programming problems realign one data structure to fit another structure.

breaking text into lines
"blocking" and "deblocking"
packet assembly and disassembly
parsing
lexing

Sometimes hard

The pieces on either side have independent state, lookahead, buffers, ...
Can be messy to do well.

Coroutines were invented to solve this problem!
They enable us to write the two pieces independently.

Let's look at this topic in the context of a lexer.

A new template system

Wanted to replace the old Go template package.
It had many problems:

inflexible
inexpressive
code too fragile

A new template system

Key change was re-engineering with a true lexer, parser, and evaluator.
Has arbitrary text plus actions in {{ }}.

Evaluation: {{.Title}}
Constants and functions: {{printf "%g: %#3X" 1.2+2i 123}}
Control structures {{range $s.Text}} {{.}} {{end}}

Today we focus on the lexer

Must tokenize:

the stuff outside actions
action delimiters
identifiers
numeric constants
string constants
and others

Lex items

Two things identify each lexed item:

its type
its value; a string is all we need

// item represents a token returned from the scanner.
type item struct {
    typ itemType  // Type, such as itemNumber.
    val string    // Value, such as "23.2".
}

Lex type

The type is just an integer constant.
We use iota to define the values.

// itemType identifies the type of lex items.

const (
    itemError itemType = iota // error occurred;
                              // value is text of error
    itemDot                   // the cursor, spelled '.'
    itemEOF

Lex type values (continued)

    itemElse       // else keyword
    itemEnd        // end keyword
    itemField      // identifier, starting with '.'
    itemIdentifier // identifier
    itemIf         // if keyword
    itemLeftMeta   // left meta-string
    itemNumber     // number
    itemPipe       // pipe symbol
    itemRange      // range keyword
    itemRawString  // raw quoted string (includes quotes)
    itemRightMeta  // right meta-string
    itemString     // quoted string (includes quotes)
    itemText       // plain text
)

Printing a lex item

Printf has a convention making it easy to print any type: just define a String() method:

func (i item) String() string {
    switch i.typ {
    case itemEOF:
        return "EOF"
    case itemError:
        return i.val
    }
    if len(i.val) > 10 {
        return fmt.Sprintf("%.10q...", i.val)
    }
    return fmt.Sprintf("%q", i.val)
}

How to tokenize?

Many approaches available:

use a tool such as lex or ragel
use regular expressions
use states, actions, and a switch statement

Tools

Nothing wrong with using a tool but:

hard to get good errors (can be very important)
tend to require learning another language
result can be large, even slow
often a poor fit
but lexing is easy to do yourself!

Regular expressions

Blogged about this last week.

overkill
slow
can explore the state space too much
misuse of a dynamic engine to ask static questions

Let's write our own

It's easy!

Plus, most programming languages lex pretty much the same tokens, so once we learn how it's trivial to adapt the lexer for the next purpose.

an argument both for and against tools

State machine

Many people will tell you to write a switch statement,
something like this:

// One iteration:
switch state {
case state1: 
    state = action1()
case state2:
    state = action2()
case state3: 
    state = action3()
}

State machines are forgetful

Boring and repetitive and error-prone, but anyway:

Why switch?

After each action, you know where you want to be;
the new state is the result of the action.

But we throw the info away and recompute it from the state.

(A consequence of returning to the caller.)

A tool can compile that out, but so can we.

What is a state? An action?

State represents where we are and what we expect.

Action represents what we do.

Actions result in a new state.

State function

Let's put them together: a state function.

Executes an action, returns the next state—as a state function.

Recursive definition but simple and clear.

// stateFn represents the state of the scanner
// as a function that returns the next state.
type stateFn func(*lexer) stateFn

The run loop

Our state machine is trivial:
just run until the state goes to nil, representing "done".

// run lexes the input by executing state functions
// until the state is nil.
func run() {
    for state := startState; state != nil; {
        state = state(lexer)
    }
}

The concurrent step

How do we make tokens available to the client?
Tokens can emerge at times that are inconvenient to stop to return to the caller.

Use concurrency:
Run the state machine as a goroutine,
emit values on a channel.

The lexer type

Here is the lexer type. Notice the channel of items; ignore the rest for now.

// lexer holds the state of the scanner.
type lexer struct {
    name  string    // used only for error reports.
    input string    // the string being scanned.
    start int       // start position of this item.
    pos   int       // current position in the input.
    width int       // width of last rune read from input.
    items chan item // channel of scanned items.
}

Starting the lexer

A lexer initializes itself to lex a string and launches the state machine as a goroutine, returning the lexer itself and a channel of items.

The API will change, don't worry about it now.

func lex(name, input string) (*lexer, chan item) {
    l := &lexer{
        name:  name,
        input: input,
        items: make(chan item),
    }
    go l.run()  // Concurrently run state machine.
    return l, l.items
}

The real run routine

Here's the real state machine run function, which runs as a goroutine.

// run lexes the input by executing state functions until
// the state is nil.
func (l *lexer) run() {
    for state := lexText; state != nil; {
        state = state(l)
    }
    close(l.items) // No more tokens will be delivered.
}

The token emitter

A token is a type and a value, but (yay Go) the value can just be sliced from the input string.
The lexer remembers where it is in the input and the emit routine just lobs that substring to the caller as the token's value.

    input string    // the string being scanned.
    start int       // start position of this item.
    pos   int       // current position in the input.

// emit passes an item back to the client.
func (l *lexer) emit(t itemType) {
    l.items <- item{t, l.input[l.start:l.pos]}
    l.start = l.pos
}

Starting the machine

As the lexer begins it's looking for plain text, so the initial state is the function lexText.
It absorbs plain text until a "left meta" is encountered.

// run lexes the input by executing state functions until
// the state is nil.
func (l *lexer) run() {
    for state := lexText; state != nil; {
        state = state(l)
    }
    close(l.items) // No more tokens will be delivered.
}

const leftMeta = "{{"

lexText

func lexText(l *lexer) stateFn {
    for {
        if strings.HasPrefix(l.input[l.pos:], leftMeta) {
            if l.pos > l.start {
                l.emit(itemText)
            }
            return lexLeftMeta    // Next state.
        }
        if l.next() == eof { break }
    }
    // Correctly reached EOF.
    if l.pos > l.start {
        l.emit(itemText)
    }
    l.emit(itemEOF)  // Useful to make EOF a token.
    return nil       // Stop the run loop.
}

lexLeftMeta

A trivial state function.
When we get here, we know there's a leftMeta in the input.

func lexLeftMeta(l *lexer) stateFn {
    l.pos += len(leftMeta)
    l.emit(itemLeftMeta)
    return lexInsideAction    // Now inside {{ }}.
}

lexInsideAction

func lexInsideAction(l *lexer) stateFn {
    // Either number, quoted string, or identifier.
    // Spaces separate and are ignored.
    // Pipe symbols separate and are emitted.
    for {
        if strings.HasPrefix(l.input[l.pos:], rightMeta) {
            return lexRightMeta
        }
        switch r := l.next(); {
        case r == eof || r == '\n':
            return l.errorf("unclosed action")
        case isSpace(r):
            l.ignore()
        case r == '|':
            l.emit(itemPipe)

More of lexInsideAction

This will give you the flavor.

        case r == '"':
            return lexQuote
        case r == '`':
            return lexRawQuote

        case r == '+' || r == '-' || '0' <= r && r <= '9':
            l.backup()
            return lexNumber
        case isAlphaNumeric(r):
            l.backup()
            return lexIdentifier

The next function

// next returns the next rune in the input.
func (l *lexer) next() (rune int) {
    if l.pos >= len(l.input) {
        l.width = 0
        return eof
    }
    rune, l.width =
        utf8.DecodeRuneInString(l.input[l.pos:])
    l.pos += l.width
    return rune
}

Some lexing helpers

// ignore skips over the pending input before this point.
func (l *lexer) ignore() {
    l.start = l.pos
}

// backup steps back one rune.
// Can be called only once per call of next.
func (l *lexer) backup() {
    l.pos -= l.width
}

The peek function

// peek returns but does not consume
// the next rune in the input.
func (l *lexer) peek() int {
    rune := l.next()
    l.backup()
    return rune
}

The accept functions

// accept consumes the next rune
// if it's from the valid set.
func (l *lexer) accept(valid string) bool {
    if strings.IndexRune(valid, l.next()) >= 0 {
        return true
    }
    l.backup()
    return false
}

// acceptRun consumes a run of runes from the valid set.
func (l *lexer) acceptRun(valid string) {
    for strings.IndexRune(valid, l.next()) >= 0 {
    }
    l.backup()
}

Lexing a number, including floating point

func lexNumber(l *lexer) stateFn {
    // Optional leading sign.
    l.accept("+-")
    // Is it hex?
    digits := "0123456789"
    if l.accept("0") && l.accept("xX") {
        digits = "0123456789abcdefABCDEF"
    }
    l.acceptRun(digits)
    if l.accept(".") {
        l.acceptRun(digits)
    }
    if l.accept("eE") {
        l.accept("+-")
        l.acceptRun("0123456789")
    }
    // Is it imaginary?

Lexing a number, continued

This is more accepting than it should be, but not by much. Caller must call Atof to validate.

    // Is it imaginary?
    l.accept("i")
    // Next thing mustn't be alphanumeric.
    if isAlphaNumeric(l.peek()) {
        l.next()
        return l.errorf("bad number syntax: %q",
            l.input[l.start:l.pos])
    }
    l.emit(itemNumber)
    return lexInsideAction
}

Errors

Easy to handle: emit the bad token and shut down the machine.

// error returns an error token and terminates the scan
// by passing back a nil pointer that will be the next
// state, terminating l.run.
func (l *lexer) errorf(format string, args ...interface{})
  stateFn {
    l.items <- item{
        itemError,
        fmt.Sprintf(format, args...),
    }
    return nil
}

Summary

Concurrency makes the lexer easy to design.

Goroutines allow lexer and caller (parser) each to run at its own rate, as clean sequential code.

Channels give us a clean way to emit tokens.

A problem

Can't run a goroutine to completion during initialization.
Forbidden by the language specification.
(Raises awful issues about order of init, best avoided.)

That means we can't lex & parse a template during init.

The goroutine is a problem....

(Note: This restriction was lifted in Go version 1 but the discussion is still interesting.)

Design vs. implementation

…but it's not necessary anyway.

The work is done by the design; now we just adjust the API.

We can change the API to hide the channel, provide a function to get the next token, and rewrite the run function.

It's easy.

A new API

Hide the channel and buffer it slightly, turning it into a ring buffer.

// lex creates a new scanner for the input string.
func lex(name, input string) *lexer {
    l := &lexer{
        name:  name,
        input: input,
        state: lexText,
        items: make(chan item, 2), // Two items sufficient.
    }
    return l
}

A function for the next item

Traditional lexer API: return next item.
Includes the modified state machine runner.

// nextItem returns the next item from the input.
func (l *lexer) nextItem() item {
    for {
        select {
        case item := <-l.items:
            return item
        default:
            l.state = l.state(l)
        }
    }
    panic("not reached")
}

That's it

We now have a traditional API for a lexer with a simple, concurrent implementation under the covers.

Even though the implementation is no longer truly concurrent, it still has all the advantages of concurrent design.

We wouldn't have such a clean, efficient design if we hadn't thought about the problem in a concurrent way, without worrying about "restart".

Model completely removes concerns about "structural mismatch".

Concurrency is a design approach

Concurrency is not about parallelism.

(Although it can enable parallelism).

Concurrency is a way to design a program by decomposing it into independently executing pieces.

The result can be clean, efficient, and very adaptable.

Conclusion

Lexers are fun.

Concurrency is fun.

Go is fun.

For more information

Go: go.dev

New templates: https://golang.org/pkg/exp/template/

(Next release will move them out of experimental.)

Lexical Scanning in Go

GTUG Sydney

30 August 2011

Video

Structural mismatch

Sometimes hard

A new template system

A new template system

Today we focus on the lexer

Lex items

Lex type

Lex type values (continued)

Printing a lex item

How to tokenize?

Tools

Regular expressions

Let's write our own

State machine

State machines are forgetful

What is a state? An action?

State function

The run loop

The concurrent step

The lexer type

Starting the lexer

The real run routine

The token emitter

Starting the machine

lexText

lexLeftMeta

lexInsideAction

More of lexInsideAction

The next function

Some lexing helpers

The peek function

The accept functions

Lexing a number, including floating point

Lexing a number, continued

Errors

Summary

A problem

Design vs. implementation

A new API

A function for the next item

That's it

Concurrency is a design approach

Conclusion

For more information

Thank you