Advanced lexical analysis
1 About
We will demonstrate several advanced features of ANTLR lexer rules.
More structured way to build patterns
We will make use of fragments to have reusable patterns.
Channels
Tokens can be streamed through different channels. So far, we have only worked the default channel when reading from
CommonTokenStream
. We will see another channel: theskip
channel.Modes
We will examine the need to switch context during lexical analysis. This is usually due to a phenomenon known as island language.
2 A Case Study
We are going to build a calculator language that supports:
- different ways to specify numbers.
- comments
2.1 An example
// income is 45.50 per hour.
45.5
// expense is $900, but it's negative.
-900
// count of students in my class
42
// HTML color for red
xff0000
3 The Lexical Rules
3.1 Number specification
3.1.1 Digits
: '0' .. '9';
fragment DIGIT : ('a' .. 'f') | ('A' .. 'F') | DIGIT; fragment HEX_DIGIT
Note: we can reuse DIGIT
in the definition of HEX_DIGIT
.
3.1.2 Numeric values
: '-'? ('1' .. '9') DIGIT*;
INTEGER
: 'x' HEX_DIGIT+;
HEX
: (INTEGER '.')
DECIMAL | (INTEGER '.' DIGIT+)
;
DIGIT
is reused at several places.
3.2 More formatting
We need to incorporate whitespaces and comments.
3.2.1 Whitespace
: (' ' | '\t' | '\r' | '\n') ; WHITESPACE
But note that we actually don’t care about whitespaces. This means that we can ignore all tokens of the type WHITESPACE
. ANTLR allows us to do this by directing tokens to the skip
channel.
: (' ' | '\t' | '\r' | '\n') -> skip; WHITESPACE
3.2.3 Entering comment mode
'//' -> pushMode(MODE_COMMENT); COMMENT:
3.2.4 Comment mode
;
mode MODE_COMMENT
' ' | '\t') -> skip;
WHITESPACE_COMMENT : (~('\n' | ' ' | '\t')+;
WORD_COMMENT : '\n' -> popMode; END_COMMENT :
4 Testing it
4.1 Prepare the lexer class
$ java -jar /antlr-4.11.1-complete.jar *.g4
$ java -cp /antlr-4.11.1-complete.jar:. *.java
4.2 Input file
// income is 45.50 per hour.
45.5
// expense is $900, but it's negative.
-900
// count of students in my class
42
// HTML color for red
xff0000
4.3 Test the lexer with testrig
$ java -cp /antlr-4.11.1-complete.jar:$PWD \
\
org.antlr.v4.gui.TestRig \
MyLexer -tokens sample.txt tokens
The output is:
[@0,0:1='//',<'//'>,1:0] enters MODE_COMMENT
[@1,3:8='income',<COMMENT_WORD>,1:3]
[@2,10:11='is',<COMMENT_WORD>,1:10]
[@3,13:17='45.50',<COMMENT_WORD>,1:13]
[@4,19:21='per',<COMMENT_WORD>,1:19]
[@5,23:27='hour.',<COMMENT_WORD>,1:23]
[@6,28:28='\n',<'\n'>,1:28] exists MODE_COMMENT
[@7,29:32='45.5',<DECIMAL>,2:0] enters DEFAULT_MODE
[@8,35:36='//',<'//'>,4:0]
[@9,38:44='expense',<COMMENT_WORD>,4:3]
[@10,46:47='is',<COMMENT_WORD>,4:11]
[@11,49:53='$900,',<COMMENT_WORD>,4:14]
[@12,55:57='but',<COMMENT_WORD>,4:20]
[@13,59:62='it's',<COMMENT_WORD>,4:24]
[@14,64:72='negative.',<COMMENT_WORD>,4:29]
[@15,73:73='\n',<'\n'>,4:38]
[@16,74:77='-900',<INTEGER>,5:0]
[@17,80:81='//',<'//'>,7:0]
[@18,83:87='count',<COMMENT_WORD>,7:3]
[@19,89:90='of',<COMMENT_WORD>,7:9]
[@20,92:99='students',<COMMENT_WORD>,7:12]
[@21,101:102='in',<COMMENT_WORD>,7:21]
[@22,104:105='my',<COMMENT_WORD>,7:24]
[@23,107:111='class',<COMMENT_WORD>,7:27]
[@24,112:112='\n',<'\n'>,7:32]
[@25,113:114='42',<INTEGER>,8:0]
[@26,117:118='//',<'//'>,10:0]
[@27,120:123='HTML',<COMMENT_WORD>,10:3]
[@28,125:129='color',<COMMENT_WORD>,10:8]
[@29,131:133='for',<COMMENT_WORD>,10:14]
[@30,135:137='red',<COMMENT_WORD>,10:18]
[@31,138:138='\n',<'\n'>,10:21]
[@32,139:145='xff0000',<HEX>,11:0]
[@33,146:145='<EOF>',<EOF>,11:7]
3.2.2 Comments and mode based analyze
Comments are challenging become the comment section of the code is an island. By definition, inside of the comment is disconnected from the rest of the code, and thus, all tokens of the program should not be from a comment section.
This can be achieved by mode based lexical analysis.
mode
.-> pushMode(NEW_MODE)
will switch the mode toNEW_MODE
.-> popMode
restores the mode to the previous mode.The lexer rules are organized as: