Advanced Lexical Analysis

Author

Ken Pu

1 Introduction to ANTLR

1.1 What is ANTLR?

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It’s widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build and walk parse trees.

Terence Parr, the author of ANTLR

1.2 Lexical analysis using ANTLR

ANTLR can help us with the construction of a lexer class. It requires a lexer grammar file .g4, which is converted to a Java class.

2 Example

2.1 The grammar file

   // SampleLexer.g4
   lexer grammar SampleLexer;

   WHITESPACE    : [ \t]+_;
   NEWLINE       : [\r\n]+;
   NUMBER        : [0-9]+;
   WORD          : [a-zA-Z]+;

2.2 ANTLR Toolchain

$ java -jar /antlr-4.11.1-complete.jar 
ANTLR Parser Generator  Version 4.11.1
 -o ___              specify output directory where all output is generated
 -lib ___            specify location of grammars, tokens files
 -atn                generate rule augmented transition network diagrams
 -encoding ___       specify grammar file encoding; e.g., euc-jp
 -message-format ___ specify output style for messages in antlr, gnu, vs2005
 -long-messages      show exception details when available for errors and warnings
 -listener           generate parse tree listener (default)
 -no-listener        don't generate parse tree listener
 -visitor            generate parse tree visitor
 -no-visitor         don't generate parse tree visitor (default)
 -package ___        specify a package/namespace for the generated code
 -depend             generate file dependencies
 -D<option>=value    set/override a grammar-level option
 -Werror             treat warnings as errors
 -XdbgST             launch StringTemplate visualizer on generated code
 -XdbgSTWait         wait for STViz to close before continuing
 -Xforce-atn         use the ATN simulator for all predictions
 -Xlog               dump lots of logging info to antlr-timestamp.log
 -Xexact-output-dir  all output goes into -o dir regardless of paths/package

Let’s generate the lexer Java class.

$ java -jar /antlr-4.11.1-complete.jar ./SampleLexer.g4
$ tree .
.
├── SampleLexer.g4
├── SampleLexer.interp    <-- new
├── SampleLexer.java      <-- new
└── SampleLexer.tokens    <-- new

Compiling the code to Java class

$ javac -cp /antlr-4.11.1-complete.jar:. ./SampleLexer.java 
$ tree .
.
├── SampleLexer.class    <-- new
├── SampleLexer.g4
├── SampleLexer.interp
├── SampleLexer.java
└── SampleLexer.tokens

2.3 Using the lexer in Kotlin

@file:DependsOn("/antlr-4.11.1-complete.jar")
@file:DependsOn(".")
import org.antlr.v4.runtime.*
val input:CharStream = CharStreams.fromString("hello 123")
val lexer = SampleLexer(input)
val stream: CommonTokenStream = CommonTokenStream(lexer)
val tokens: List<Token> = stream.apply {
    this.fill()
}.getTokens()
tokens.joinToString("\n")
[@0,0:4='hello',<4>,1:0]
[@1,5:5=' ',<1>,1:5]
[@2,6:8='123',<3>,1:6]
[@3,9:8='<EOF>',<-1>,1:9]
To be completed