Complete Compilation

Author

Ken Pu

1 Syntax

For simplicity, we will focus on a very simple language as illustrated the following sample code. We will call this language Calc.

"compute_area"
pi = 31415
a = 3
b = 4
area = pi * (a + b) * (a + b)

print area

The grammar can be quite easily specified in Antlr.

For brevity, we will omit the lexer rules. The grammar defines the following:

  • ID: variable identifier
  • NUMBER: integer literals
  • STRING: double-quoted strings
  • NL: newline
  • Whitespaces are discarded by the tokenizer.

Calc.g4

grammar Calc;

program
  : STRING NL* block EOF
  ;

We include the program name as a string in the source code.

It is then followed by a block of code.

block
  : (stmt NL)* stmt NL?
  ;

A block is one or more newline separated statements stmt. Note the last newline is optional.

stmt
  : assign
  | print
  ;

A statement can either be a variable assignment assign, or a print statement print.

assign
  : ID '=' expr
  ;

Variable assignments assign any expression expr to a variable specified by its ID.

print
  : 'print' expr
  ;

We can print the expression value using the print statement.

expr
  : expr ('*' | '/' | '+' | '/') expr
  | '(' expr ')'
  | NUMBER
  | ID
  ;

An expression expr is any arithmetic expression. It allows number literals and variable names.

2 Compiler Backend

The compiler backend are the Kotlin classes that implement either Stmt or Expr abstract classes as described in Compiler Basics

We have already seen many programming constructs as implementations of Stmt or Expr.

Here is an example. It is the implementation for integer literals.

class IntExpr(
    val value: Int,
): Expr() {
    override fun gen(scope: Scope): List<String> 
    = listOf(
        "ldc $value"
    )
    companion object {
        @JvmStatic
        fun make(text:String): IntExpr
        = IntExpr(Integer.parseInt(text))
    }
}

Note:

Each implementation has a companion object with the make method that can construct an object from the parsed results.

In the case of IntExpr, it can be constructed by IntExpr.make(lexeme) where lexeme is the text string of the NUMBER token in the parse tree.

Another example is the arithmetic expression as implemented by ArithExpr.

class ArithExpr(
    val op: ArithExpr.Op,
    val left: Expr,
    val right: Expr,
): Expr() {
    enum class Op {Add, Sub, Mul, Div}
    override fun gen(scope: Scope): List<String> 
    = left.gen(scope) +
      right.gen(scope) +
      when(op) {
          Op.Add -> "iadd"
          Op.Sub -> "isub"
          Op.Mul -> "imul"
          Op.Div -> "idiv"
      }
    companion object {
      @JvmStatic
      fun make(
        optext:String,
        x:Expr,
        y:Expr
      ):ArithExpr {
        val op = when(optext) {
            "+" -> ArithExpr.Op.Add
            "-" -> ArithExpr.Op.Sub
            "*" -> ArithExpr.Op.Mul
            "/" -> ArithExpr.Op.Div
            else -> null
        } ?: throw Exception("Unknown operator")
        return ArithExpr(op, x, y)
      }
    }

}

This is an expression (limited to integer arithmetics). It has an ArithExpr.make(...) method that constructs the expression object based on operator token lexeme, and sub-expression objects.

2.1 A list of the compiler backend objects

class ClassStmt: Stmt()
class MethodStmt: Stmt()
class Print: Stmt()
class AssignStmt: Stmt()
class BlockStmt: Stmt()
class IntExpr: Expr()
class DerefExpr: Expr()
class ArithExpr: Expr()

3 Syntax Directed Definitions

With the support for data attributes and embedded actions, we can implement the SDD compiler object construction as part of the ANTLR grammar file.

Lets go through selected modification.

grammar Calc;

@header {
  import java.util.List;
  import java.util.ArrayList;
}

We need to build lists of statement objects using ArrayList later.

program returns [ClassStmt code]
  : STRING NL* block EOF
    { $code = ClassStmt.make(
        $STRING.text,
        $block.code); 
    }
  ;


Note:

  1. program rules now have a data attribute ClassStmt code.
  2. The $code data attribute is constructed by the action that uses ClassStmt.make function to build the class object.
block returns [BlockStmt code]
  @init { List<Stmt> stmtList = new ArrayList<>(); }
  : (x=stmt { stmtList.add($x.code); } NL) *
    y=stmt { stmtList.add($y.code); } NL?
    { $code = BlockStmt.make(stmtList); }
  ;








Note:

  1. block rule has a data attribute BlockStmt code.
  2. It also has a local variable List<Stmt> stmtList that is initialized in the @init action.
  3. In the iteration over (stmt NL)* stmt, we have embedded actions which updates the stmtList with the newly created Stmt objects.

Also note that parse tree labels used in the rule: x=stmt ... y=stmt.

  1. An action is added at the end to construct the BlockStmt using the BlockStmt.make function.

4 A Complete Compiler

Now, we can integrate everything together into a single compiler executable.

The Compiler toolchain was developered in earlier lectures. See Generating Java class and Java class loader sections in Compiler Basics

import org.antlr.v4.runtime.*
import java.io.*

fun main(args:List<String>) {
    val sourceFile = args[0]
    val input = CharStreams.fromFile(sourceFile)
    val lexer = CalcLexer(input)
    val tokens = CommonTokenStream(lexer)
    val parser = CalcParser(tokens)
    
    parser.program().code.let {
        Compiler.compile(it)
        Compiler.run(it.className)
    }
}

Note

  • Antlr generates CalcLexer and CalcParser from the grammar file Calc.g4.
  • From the parser, we can obtain the ClassStmt code data attribute.
  • The Compiler object is the compiler toolchain developed earlier.

4.1 Putting things together:

Let’s go through the build steps of our compiler:

  1. Compile the backend source code (.kt) to Java classes:
$ kotinc -d target/backend \
     -cp jasmin.jar \
     src/backend/*.kt
  1. Generate from the grammar file (*.g4) the lexer and parser source code files (.java)
$ java -jar antlr.jar \
      -o target/grammar \
      src/antlr/*.g4
  1. Compile the parser class
$ javac -d target/parser \
    -cp antlr.jar:target/backend \
    target/grammar/src/antlr/*.java 
  1. Compile the actual compiler as a kotlin application
kotlinc -d target \
    -cp jasmin.jar:antlr.jar:target/backend:target/parser \
    Main.kt

This allows us to parse, compile and run our own scripts.

Here is a example:

sample.txt

"Sample"
x = 3
y = 4
z = x * y
z = z + 10
print z

We can run:

$ java MainKt sample1.txt
22

5 Stack and Register Allocation

Let’s take a closer look at the ClassStmt’s make method:

class ClassStmt( ... ) {
    ...
    companion object {
        @JvmStatic
        fun make(name: String, body: Stmt):ClassStmt
        = ClassStmt(
            name.replace("\"", ""),
            MethodStmt(
                "main",
                "([Ljava/lang/String;)V",
                stackLimit = 10,  // 👈
                localsLimit = 10, // 👈
                body
            )
        )
    }
}

Note

We limited the stack size and register count to 10.

5.1 Register Count Estimation

Registers are used for local variables. So, a conservative estimate is just the number of local variables defined in the scope.

Conservative estimate of localsLimit

fun estimateLocals(code:Code):Int {
    val scope = Scope().apply {
        code.gen(this)
    }
    return scope.size
}

This is not an optimal estimation. Consider the code:

x = 1
print x
y = 2
print y
  • Both x and y can use the same register, so the optimal estimation is 1.
  • But estimateLocals will be 2.

5.2 Estimation of stack size

Consider the code for 1 + 2 + 3 + 4.

The generated code is:

ldc 1
ldc 2
iadd
ldc 3
iadd
ldc 4

This only requires a stack size of 2.

A very similar expression: 1 + (2 + (3 + 4))) generates the following bytecode:

ldc 1
ldc 2
ldc 3
ldc 4
iadd
iadd
iadd

This requires stack size of 4.

A conservative estimate of the stack size is the maximum height of all expressions in the code. The height of an Expr node can be computed by the parser as part of the SDD.

expr returns [Expr code, int height]
  : NUMBER { $code = IntExpr.make($NUMBER.text);
             $height = 1; }
  | x=expr op='+' y=expr
    { $code = ArithExpr.make($op, $x, $y);
      $height = max($x.height, $y.height) + 1; }
  | ...
  ;

Conservative estimate of stack size

Given a code:Code, we need to locate all expressions Expr in the code using tree walk, and return the maximum expression height.

6 Conclusion

6.1 We covered

6.2 Future Compiler Research