Structured Lexical Analysis With ANTLR

Author

Ken Pu

1 Introduction to ANTLR

1.1 What is ANTLR?

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It’s widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build and walk parse trees.

Terence Parr, the author of ANTLR

1.2 Lexical analysis using ANTLR

ANTLR can help us with the construction of a lexer class. It requires a lexer grammar file .g4, which is converted to a Java class.

2 Example

2.1 The grammar file

   // SampleLexer.g4
   lexer grammar SampleLexer;

   WHITESPACE    : [ \t]+;
   NEWLINE       : [\r\n]+;
   NUMBER        : [0-9]+;
   WORD          : [a-zA-Z]+;

2.2 A slightly different syntax

WHITESPACE : ' ' | '\t';
NEWLINE : '\r' | '\n';
NUMBER: ' ('0' .. '9')+;
WORD: (('a' .. 'z') | ('A' .. 'Z'))+;

2.3 ANTLR Toolchain

$ java -jar /antlr-4.11.1-complete.jar 
ANTLR Parser Generator  Version 4.11.1
 -o ___              specify output directory where all output is generated
 -lib ___            specify location of grammars, tokens files
 -atn                generate rule augmented transition network diagrams
 -encoding ___       specify grammar file encoding; e.g., euc-jp
 -message-format ___ specify output style for messages in antlr, gnu, vs2005
 -long-messages      show exception details when available for errors and warnings
 -listener           generate parse tree listener (default)
 -no-listener        don't generate parse tree listener
 -visitor            generate parse tree visitor
 -no-visitor         don't generate parse tree visitor (default)
 -package ___        specify a package/namespace for the generated code
 -depend             generate file dependencies
 -D<option>=value    set/override a grammar-level option
 -Werror             treat warnings as errors
 -XdbgST             launch StringTemplate visualizer on generated code
 -XdbgSTWait         wait for STViz to close before continuing
 -Xforce-atn         use the ATN simulator for all predictions
 -Xlog               dump lots of logging info to antlr-timestamp.log
 -Xexact-output-dir  all output goes into -o dir regardless of paths/package

Let’s generate the lexer Java class.

$ java -jar /antlr-4.11.1-complete.jar ./SampleLexer.g4
$ tree .
.
├── SampleLexer.g4
├── SampleLexer.interp    <-- new
├── SampleLexer.java      <-- new
└── SampleLexer.tokens    <-- new

Compiling the code to Java class

$ javac -cp /antlr-4.11.1-complete.jar:. ./SampleLexer.java 
$ tree .
.
├── SampleLexer.class    <-- new
├── SampleLexer.g4
├── SampleLexer.interp
├── SampleLexer.java
└── SampleLexer.tokens

2.4 Using the lexer in Kotlin

@file:DependsOn("/antlr-4.11.1-complete.jar")
@file:DependsOn(".")

import org.antlr.v4.runtime.*

val input:CharStream = CharStreams.fromString("hello 123")

val lexer = SampleLexer(input)

val stream: CommonTokenStream = CommonTokenStream(lexer)

val tokens: List<Token> = stream.apply {
    this.fill()
}.getTokens()

tokens.joinToString("\n")

[@0,0:4='hello',<4>,1:0]
[@1,5:5=' ',<1>,1:5]
[@2,6:8='123',<3>,1:6]
[@3,9:8='<EOF>',<-1>,1:9]

3 Kotlin + Java + ANTLR

3.1 Directory structure

.
├── Main.kt
├── Makefile
└── my
    └── SampleLexer.g4

3.2 The Makefile

ANTLR = /antlr-4.11.1-complete.jar
CLASSPATH = $(ANTLR):.

lexer:
        java -jar $(ANTLR) my/*.g4

app:
        javac -cp $(CLASSPATH) my/*.java
        kotlinc -cp $(CLASSPATH) ./Main.kt

run:
        kotlin -cp $(CLASSPATH) MainKt

clean:
        rm -f my/*.class my/*.interp my/*.tokens my/*.java
        rm -rf ./META-INF ./*.class

3.3 The lexer file

lexer grammar SampleLexer;

@lexer::header {
  package my;
}

WHITESPACE    : [ \t]+;
NEWLINE       : [\r\n]+;
NUMBER        : [0-9]+;
WORD          : [a-zA-Z]+;

The @lexer::header allows us to inject code before the generated lexer class.

3.4 Generating Java source code

$ make lexer

This generates a Java source code that implements a lexer:

// Generated from java-escape by ANTLR 4.11.1

  package my;

import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.Token;
import org.antlr.v4.runtime.TokenStream;
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.atn.*;
import org.antlr.v4.runtime.dfa.DFA;
import org.antlr.v4.runtime.misc.*;

@SuppressWarnings({"all", "warnings", "unchecked", "unused", "cast", "CheckReturnValue"})
public class SampleLexer extends Lexer {
    static { RuntimeMetaData.checkVersion("4.11.1", RuntimeMetaData.VERSION); }

    protected static final DFA[] _decisionToDFA;
    protected static final PredictionContextCache _sharedContextCache =
        new PredictionContextCache();
    public static final int
        WHITESPACE=1, NEWLINE=2, NUMBER=3, WORD=4;
    public static String[] channelNames = {
        "DEFAULT_TOKEN_CHANNEL", "HIDDEN"
    };
    ...
}

3.5 The Kotlin application

import org.antlr.v4.runtime.*
import my.*

fun main() {
  val input = CharStreams.fromString("hello 123")
  val lexer = SampleLexer(input)
  val stream = CommonTokenStream(lexer)
  val tokens = stream.apply {
    this.fill()
  }.getTokens()
  .joinToString("\n")
  println(tokens)
}

3.6 ANTLR Testrig

Rig: a platform.
Test rig is a platform for testing.
org.antlr.v4.gui.TestRig is an executable class that tests ANTLR grammars and lexers.

java -cp ... org.antlr.v4.gui.TestRig

java org.antlr.v4.gui.TestRig GrammarName startRuleName
    [-tokens] [-tree] [-gui] [-ps file.ps] [-encoding encodingname]
    [-trace] [-diagnostics] [-SLL]
    [input-filename(s)]
    Use startRuleName='tokens' if GrammarName is a lexer grammar.
    Omitting input-filename makes rig read from stdin

We can use the testrig to try out a lexer on a given file.

$ java -cp $(CLASSPATH) \
    org.antlr.v4.gui.TestRig \
    my.SampleLexer \
    tokens -tokens \
    files/sample.txt

Here is a sample output:

[@0,0:4='hello',<WORD>,1:0]
[@1,5:5=' ',<WHITESPACE>,1:5]
[@2,6:8='123',<NUMBER>,1:6]
[@3,9:9='\n',<NEWLINE>,1:9]
[@4,10:14='world',<WORD>,2:0]
[@5,15:15='\n',<NEWLINE>,2:5]
[@6,16:18='456',<NUMBER>,3:0]
[@7,19:19='\n',<NEWLINE>,3:3]
[@8,20:19='<EOF>',<EOF>,4:0]