When given a SQL query, the first job of query processing is to apply lexical
and syntax analysis.
Lexical analysis (or tokenization) converts the query statement into a sequence
of tokens.
Syntax analysis (or parsing) analyzes symbols conforming certain rules.
goyacc
is the Go version of yacc
, which helps to do syntax analysis easily.
Environment
- macOS Catalina, Version 10.15.7.
Install goyacc
The yacc tool is not installed by default anymore with Go since version 1.8. The repository is still there so we can install it manually.
-
Download the
goyacc
directory via SVN (according to this post).$ sudo apt install subversion $ svn export https://github.com/golang/tools/trunk/cmd/goyacc
-
Build.
$ go build yacc.go
-
Copy the binary to Go’s tool directory.
$ sudo cp yacc /usr/local/go/pkg/tool/linux_amd64/
-
Run.
$ go tool yacc usage: yacc [-o output] [-v parsetable] input
A simple example
Let’s start with a simple example that has just one file expr.y
.
It implements the basic addition.
/******************************************************************************
* declarations
*****************************************************************************/
%{
package main
import (
"fmt"
"os"
)
%}
%union {
num int
}
%type <num> expr expr1
%token '+'
%token <num> NUM
%%
/******************************************************************************
* rules
*****************************************************************************/
top: expr { fmt.Println($1) }
expr: expr1 | expr '+' expr1 { $$ = $1 + $3 }
expr1: NUM
%%
/******************************************************************************
* programs (in Go)
*****************************************************************************/
const eof = 0
type exprLex struct {
line string
index int
}
func (x *exprLex) Lex(yylval *yySymType) int {
if x.index >= len(x.line) {
return eof
}
if x.line[x.index] == '+' {
x.index++
return int('+')
}
num := 0
for ; x.index < len(x.line); x.index++ {
if x.line[x.index] < '0' || x.line[x.index] > '9' {
break
}
num = num * 10 + int(x.line[x.index] - '0')
}
yylval.num = num
return NUM
}
func (x *exprLex) Error(s string) {
fmt.Println("parse error: ", s)
}
func main() {
yyParse(&exprLex{line: os.Args[1]})
}
To build:
$ go tool yacc expr.y
This generates two new files:
y.go
— the parser output.y.output
— the grammar report.
The Go code generated can be used to calculate, for example, 1+2
:
$ go run y.go "1+2"
3
Let’s go through the code step by step.
Structure
A yacc
file consists of three parts.
They are separated by the symbol %%
.
The structure can be summarized as:
{% /* Embedded Go codes */ %}
/* Declarations — including %union, %type, %token, %left, %right, %start, etc. */
%%
/* Rules — including terminal symbols, non-terminal symbols, and their rules. */
%%
/* Embedded Go codes. */
It’s also a common practice to move the third part to separate Go files.
Take the above expr.y
for instance, two additinal Go files can be used to hold
the content of the thrid part:
lex.go
to include the constanteof
, structureexprLex
, and functionsLex
andError
.main.go
to include the functionmain
.
Interfaces
Looking into the generated Go code of y.go
, there are two interfaces.
type yyLexer interface {
Lex(lval *yySymType) int
Error(s string)
}
type yyParser interface {
Parse(yyLexer) int
Lookahead() int
}
yyLexer
implements the lexical analysis. It requires users to provide the implementation (which are the functionsLex
andError
inexpr.y
).yyParser
implements the the syntax analysis. It’s generated bygoyacc
(which are the functionsParse
andLookahead
iny.go
).
Note that yyLexer
and yyParser
are the default interface names generated by
goyacc
.
Both can be custommized by telling the prefix preferred.
For example, the command below will generate interfaces named exprLexer
and exprParser
.
$ go tool yacc -p "expr" expr.y
Terminal and non-terminal symbols
From the part of declarations,
%union
declares types and maps each of them to a type of Go data structure.type
declares non-terminal symbols.token
declars terminal symbols. A terminal symbol can not be split further, and must be one of the types defined in%union
. A token is fetched by the lexer (the functionLex
inexpr.y
) and then passed to the parser.
The part of rules defines each non-terminal symbol and their actions.
In general, a non-terminal symbol can be another non-terminal symbol or terminal
symbol (listed with the delimiter |
).
The action of a non-terminal symbol definition is optional.
If any, the action is defined inside of braces {}
in Go codes.
For a non-terminal symbol and its definition, the action uses $$
to represent
the symbol itself while using $1
, $2
, … to represent each symbol that
constitutes that symbol.
For example, in expr.y
we have:
expr: expr1 | expr '+' expr1 { $$ = $1 + $3 }
expr1: NUM
expr1
is a non-terminal symbol. It can be a token ofNUM
(which is a terminal symbol).expr
is a non-terminal symbol. It can be eitherexpr1
orexpr '+' expr1
. For the latter definition (expr: expr '+' expr1
), inside the action,$$
isexpr
,$1
isexpr
,$2
is+
, and$3
isexpr1
.