goyacc Tutorial

When given a SQL query, the first job of query processing is to apply lexical and syntax analysis. Lexical analysis (or tokenization) converts the query statement into a sequence of tokens. Syntax analysis (or parsing) analyzes symbols conforming certain rules. goyacc is the Go version of yacc, which helps to do syntax analysis easily.

Environment

  • macOS Catalina, Version 10.15.7.

Install goyacc

The yacc tool is not installed by default anymore with Go since version 1.8. The repository is still there so we can install it manually.

  1. Download the goyacc directory via SVN (according to this post).

    $ sudo apt install subversion
    $ svn export https://github.com/golang/tools/trunk/cmd/goyacc
    
  2. Build.

    $ go build yacc.go 
    
  3. Copy the binary to Go’s tool directory.

    $ sudo cp yacc /usr/local/go/pkg/tool/linux_amd64/
    
  4. Run.

    $ go tool yacc
    usage: yacc [-o output] [-v parsetable] input
    

A simple example

Let’s start with a simple example that has just one file expr.y. It implements the basic addition.

/******************************************************************************
 * declarations
 *****************************************************************************/

%{
package main
import (
    "fmt"
    "os"
)
%}

%union {
    num int
}
%type <num> expr expr1
%token '+'
%token <num> NUM

%%
/******************************************************************************
 * rules
 *****************************************************************************/

top: expr { fmt.Println($1) }
expr: expr1 | expr '+' expr1 { $$ = $1 + $3 }
expr1: NUM

%%
/******************************************************************************
 * programs (in Go)
 *****************************************************************************/

const eof = 0

type exprLex struct {
    line string
    index int
}

func (x *exprLex) Lex(yylval *yySymType) int {
    if x.index >= len(x.line) {
        return eof
    }
    if x.line[x.index] == '+' {
        x.index++
        return int('+')
    }
    num := 0
    for ; x.index < len(x.line); x.index++ {
        if x.line[x.index] < '0' || x.line[x.index] > '9' {
            break
        }
        num = num * 10 + int(x.line[x.index] - '0')
    }
    yylval.num = num
    return NUM
}

func (x *exprLex) Error(s string) {
    fmt.Println("parse error: ", s)
}

func main() {
    yyParse(&exprLex{line: os.Args[1]})
}

To build:

$ go tool yacc expr.y

This generates two new files:

  • y.go — the parser output.
  • y.output — the grammar report.

The Go code generated can be used to calculate, for example, 1+2:

$ go run y.go "1+2"
3

Let’s go through the code step by step.

Structure

A yacc file consists of three parts. They are separated by the symbol %%. The structure can be summarized as:

{% /* Embedded Go codes */ %}
/* Declarations — including %union, %type, %token, %left, %right, %start, etc. */
%%
/* Rules — including terminal symbols, non-terminal symbols, and their rules. */
%%
/* Embedded Go codes. */

It’s also a common practice to move the third part to separate Go files. Take the above expr.y for instance, two additinal Go files can be used to hold the content of the thrid part:

  • lex.go to include the constant eof, structure exprLex, and functions Lex and Error.
  • main.go to include the function main.

Interfaces

Looking into the generated Go code of y.go, there are two interfaces.

type yyLexer interface {
	Lex(lval *yySymType) int
	Error(s string)
}

type yyParser interface {
	Parse(yyLexer) int
	Lookahead() int
}
  • yyLexer implements the lexical analysis. It requires users to provide the implementation (which are the functions Lex and Error in expr.y).
  • yyParser implements the the syntax analysis. It’s generated by goyacc (which are the functions Parse and Lookahead in y.go).

Note that yyLexer and yyParser are the default interface names generated by goyacc. Both can be custommized by telling the prefix preferred. For example, the command below will generate interfaces named exprLexer and exprParser.

$ go tool yacc -p "expr" expr.y

Terminal and non-terminal symbols

From the part of declarations,

  • %union declares types and maps each of them to a type of Go data structure.
  • type declares non-terminal symbols.
  • token declars terminal symbols. A terminal symbol can not be split further, and must be one of the types defined in %union. A token is fetched by the lexer (the function Lex in expr.y) and then passed to the parser.

The part of rules defines each non-terminal symbol and their actions. In general, a non-terminal symbol can be another non-terminal symbol or terminal symbol (listed with the delimiter |).

The action of a non-terminal symbol definition is optional. If any, the action is defined inside of braces {} in Go codes. For a non-terminal symbol and its definition, the action uses $$ to represent the symbol itself while using $1, $2, … to represent each symbol that constitutes that symbol. For example, in expr.y we have:

expr: expr1 | expr '+' expr1 { $$ = $1 + $3 }
expr1: NUM
  • expr1 is a non-terminal symbol. It can be a token of NUM (which is a terminal symbol).
  • expr is a non-terminal symbol. It can be either expr1 or expr '+' expr1. For the latter definition (expr: expr '+' expr1), inside the action, $$ is expr, $1 is expr, $2 is +, and $3 is expr1.

Contents