@file:DependsOn("/antlr-4.11.1-complete.jar")
@file:DependsOn(".")
Structured Lexical Analysis With ANTLR
1 Introduction to ANTLR
1.1 What is ANTLR?
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It’s widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build and walk parse trees.
Terence Parr, the author of ANTLR
1.2 Lexical analysis using ANTLR
ANTLR can help us with the construction of a lexer class. It requires a lexer grammar file .g4
, which is converted to a Java class.
2 Example
2.1 The grammar file
// SampleLexer.g4 lexer grammar SampleLexer; WHITESPACE : [ \t]+; NEWLINE : [\r\n]+; NUMBER : [0-9]+; WORD : [a-zA-Z]+;
2.2 A slightly different syntax
WHITESPACE : ' ' | '\t';
NEWLINE : '\r' | '\n';
NUMBER: ' ('0' .. '9')+;
WORD: (('a' .. 'z') | ('A' .. 'Z'))+;
2.3 ANTLR Toolchain
$ java -jar /antlr-4.11.1-complete.jar
ANTLR Parser Generator Version 4.11.1
-o ___ specify output directory where all output is generated
-lib ___ specify location of grammars, tokens files
-atn generate rule augmented transition network diagrams
-encoding ___ specify grammar file encoding; e.g., euc-jp
-message-format ___ specify output style for messages in antlr, gnu, vs2005
-long-messages show exception details when available for errors and warnings
-listener generate parse tree listener (default)
-no-listener don't generate parse tree listener
-visitor generate parse tree visitor
-no-visitor don't generate parse tree visitor (default)
-package ___ specify a package/namespace for the generated code
-depend generate file dependencies
-D<option>=value set/override a grammar-level option
-Werror treat warnings as errors
-XdbgST launch StringTemplate visualizer on generated code
-XdbgSTWait wait for STViz to close before continuing
-Xforce-atn use the ATN simulator for all predictions
-Xlog dump lots of logging info to antlr-timestamp.log
-Xexact-output-dir all output goes into -o dir regardless of paths/package
Let’s generate the lexer Java class.
$ java -jar /antlr-4.11.1-complete.jar ./SampleLexer.g4
$ tree .
.
├── SampleLexer.g4
├── SampleLexer.interp <-- new
├── SampleLexer.java <-- new
└── SampleLexer.tokens <-- new
Compiling the code to Java class
$ javac -cp /antlr-4.11.1-complete.jar:. ./SampleLexer.java
$ tree .
.
├── SampleLexer.class <-- new
├── SampleLexer.g4
├── SampleLexer.interp
├── SampleLexer.java
└── SampleLexer.tokens
2.4 Using the lexer in Kotlin
import org.antlr.v4.runtime.*
val input:CharStream = CharStreams.fromString("hello 123")
val lexer = SampleLexer(input)
val stream: CommonTokenStream = CommonTokenStream(lexer)
val tokens: List<Token> = stream.apply {
this.fill()
}.getTokens()
.joinToString("\n") tokens
[@0,0:4='hello',<4>,1:0]
[@1,5:5=' ',<1>,1:5]
[@2,6:8='123',<3>,1:6]
[@3,9:8='<EOF>',<-1>,1:9]
3 Kotlin + Java + ANTLR
3.1 Directory structure
.
├── Main.kt
├── Makefile
└── my
└── SampleLexer.g4
3.2 The Makefile
ANTLR = /antlr-4.11.1-complete.jar
CLASSPATH = $(ANTLR):.
lexer:
java -jar $(ANTLR) my/*.g4
app:
javac -cp $(CLASSPATH) my/*.java
kotlinc -cp $(CLASSPATH) ./Main.kt
run:
kotlin -cp $(CLASSPATH) MainKt
clean:
rm -f my/*.class my/*.interp my/*.tokens my/*.java
rm -rf ./META-INF ./*.class
3.3 The lexer file
lexer grammar SampleLexer;
@lexer::header {
package my;
}
WHITESPACE : [ \t]+;
NEWLINE : [\r\n]+;
NUMBER : [0-9]+;
WORD : [a-zA-Z]+;
The @lexer::header
allows us to inject code before the generated lexer class.
3.4 Generating Java source code
$ make lexer
This generates a Java source code that implements a lexer:
// Generated from java-escape by ANTLR 4.11.1
package my;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.Token;
import org.antlr.v4.runtime.TokenStream;
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.atn.*;
import org.antlr.v4.runtime.dfa.DFA;
import org.antlr.v4.runtime.misc.*;
@SuppressWarnings({"all", "warnings", "unchecked", "unused", "cast", "CheckReturnValue"})
public class SampleLexer extends Lexer {
static { RuntimeMetaData.checkVersion("4.11.1", RuntimeMetaData.VERSION); }
protected static final DFA[] _decisionToDFA;
protected static final PredictionContextCache _sharedContextCache =
new PredictionContextCache();
public static final int
=1, NEWLINE=2, NUMBER=3, WORD=4;
WHITESPACEpublic static String[] channelNames = {
"DEFAULT_TOKEN_CHANNEL", "HIDDEN"
};
...
}
3.5 The Kotlin application
import org.antlr.v4.runtime.*
import my.*
main() {
fun = CharStreams.fromString("hello 123")
val input = SampleLexer(input)
val lexer = CommonTokenStream(lexer)
val stream = stream.apply {
val tokens this.fill()
}.getTokens()
.joinToString("\n")
println(tokens)
}
3.6 ANTLR Testrig
- Rig: a platform.
- Test rig is a platform for testing.
org.antlr.v4.gui.TestRig
is an executable class that tests ANTLR grammars and lexers.
java -cp ... org.antlr.v4.gui.TestRig
java org.antlr.v4.gui.TestRig GrammarName startRuleName
[-tokens] [-tree] [-gui] [-ps file.ps] [-encoding encodingname]
[-trace] [-diagnostics] [-SLL]
[input-filename(s)]
Use startRuleName='tokens' if GrammarName is a lexer grammar.
Omitting input-filename makes rig read from stdin
We can use the testrig to try out a lexer on a given file.
$ java -cp $(CLASSPATH) \
org.antlr.v4.gui.TestRig \
my.SampleLexer \
tokens -tokens \
files/sample.txt
Here is a sample output:
[@0,0:4='hello',<WORD>,1:0]
[@1,5:5=' ',<WHITESPACE>,1:5]
[@2,6:8='123',<NUMBER>,1:6]
[@3,9:9='\n',<NEWLINE>,1:9]
[@4,10:14='world',<WORD>,2:0]
[@5,15:15='\n',<NEWLINE>,2:5]
[@6,16:18='456',<NUMBER>,3:0]
[@7,19:19='\n',<NEWLINE>,3:3]
[@8,20:19='<EOF>',<EOF>,4:0]