Advanced lexical analysis

Author

Ken Pu

1 About

We will demonstrate several advanced features of ANTLR lexer rules.

  • More structured way to build patterns

    We will make use of fragments to have reusable patterns.

  • Channels

    Tokens can be streamed through different channels. So far, we have only worked the default channel when reading from CommonTokenStream. We will see another channel: the skip channel.

  • Modes

    We will examine the need to switch context during lexical analysis. This is usually due to a phenomenon known as island language.

2 A Case Study

We are going to build a calculator language that supports:

  • different ways to specify numbers.
  • comments

2.1 An example

// income is 45.50 per hour.
45.5

// expense is $900, but it's negative.
-900

// count of students in my class
42

// HTML color for red
xff0000

3 The Lexical Rules

3.1 Number specification

3.1.1 Digits

fragment DIGIT     : '0' .. '9';
fragment HEX_DIGIT : ('a' .. 'f') | ('A' .. 'F') | DIGIT;

Note: we can reuse DIGIT in the definition of HEX_DIGIT.

3.1.2 Numeric values

INTEGER : '-'? ('1' .. '9') DIGIT*;

HEX     : 'x' HEX_DIGIT+;

DECIMAL : (INTEGER '.') 
        | (INTEGER '.' DIGIT+)
        ;

DIGIT is reused at several places.

3.2 More formatting

We need to incorporate whitespaces and comments.

3.2.1 Whitespace

WHITESPACE : (' ' | '\t' | '\r' | '\n') ;

But note that we actually don’t care about whitespaces. This means that we can ignore all tokens of the type WHITESPACE. ANTLR allows us to do this by directing tokens to the skip channel.

WHITESPACE : (' ' | '\t' | '\r' | '\n') -> skip;

3.2.2 Comments and mode based analyze

Comments are challenging become the comment section of the code is an island. By definition, inside of the comment is disconnected from the rest of the code, and thus, all tokens of the program should not be from a comment section.

This can be achieved by mode based lexical analysis.

  • The lexer will maintain a state, called mode.
  • There is a default mode that the lexer starts with.
  • -> pushMode(NEW_MODE) will switch the mode to NEW_MODE.
  • -> popMode restores the mode to the previous mode.

The lexer rules are organized as:

// rules for default mode
RULE : pattern;
RULE : pattern;
...

mode MODE1;      \
                  |
RULE: pattern;    | these rules are only used
RULE: pattern;    | in MODE1;
...              /

mode MODE2;

RULE: pattern;
RULE: pattern;
...

3.2.3 Entering comment mode

COMMENT: '//' -> pushMode(MODE_COMMENT);

3.2.4 Comment mode

mode MODE_COMMENT;

WHITESPACE_COMMENT : (' ' | '\t') -> skip;
WORD_COMMENT       : ~('\n' | ' ' | '\t')+;
END_COMMENT        : '\n' -> popMode;

4 Testing it

4.1 Prepare the lexer class

$ java -jar /antlr-4.11.1-complete.jar *.g4
$ java -cp /antlr-4.11.1-complete.jar:. *.java

4.2 Input file

sample.txt
// income is 45.50 per hour.
45.5

// expense is $900, but it's negative.
-900

// count of students in my class
42

// HTML color for red
xff0000

4.3 Test the lexer with testrig

$ java -cp /antlr-4.11.1-complete.jar:$PWD \
       org.antlr.v4.gui.TestRig \
       MyLexer \
       tokens -tokens sample.txt

The output is:

[@0,0:1='//',<'//'>,1:0]                      enters MODE_COMMENT
[@1,3:8='income',<COMMENT_WORD>,1:3]
[@2,10:11='is',<COMMENT_WORD>,1:10]
[@3,13:17='45.50',<COMMENT_WORD>,1:13]
[@4,19:21='per',<COMMENT_WORD>,1:19]
[@5,23:27='hour.',<COMMENT_WORD>,1:23]
[@6,28:28='\n',<'\n'>,1:28]                   exists MODE_COMMENT
[@7,29:32='45.5',<DECIMAL>,2:0]               enters DEFAULT_MODE
[@8,35:36='//',<'//'>,4:0]
[@9,38:44='expense',<COMMENT_WORD>,4:3]
[@10,46:47='is',<COMMENT_WORD>,4:11]
[@11,49:53='$900,',<COMMENT_WORD>,4:14]
[@12,55:57='but',<COMMENT_WORD>,4:20]
[@13,59:62='it's',<COMMENT_WORD>,4:24]
[@14,64:72='negative.',<COMMENT_WORD>,4:29]
[@15,73:73='\n',<'\n'>,4:38]
[@16,74:77='-900',<INTEGER>,5:0]
[@17,80:81='//',<'//'>,7:0]
[@18,83:87='count',<COMMENT_WORD>,7:3]
[@19,89:90='of',<COMMENT_WORD>,7:9]
[@20,92:99='students',<COMMENT_WORD>,7:12]
[@21,101:102='in',<COMMENT_WORD>,7:21]
[@22,104:105='my',<COMMENT_WORD>,7:24]
[@23,107:111='class',<COMMENT_WORD>,7:27]
[@24,112:112='\n',<'\n'>,7:32]
[@25,113:114='42',<INTEGER>,8:0]
[@26,117:118='//',<'//'>,10:0]
[@27,120:123='HTML',<COMMENT_WORD>,10:3]
[@28,125:129='color',<COMMENT_WORD>,10:8]
[@29,131:133='for',<COMMENT_WORD>,10:14]
[@30,135:137='red',<COMMENT_WORD>,10:18]
[@31,138:138='\n',<'\n'>,10:21]
[@32,139:145='xff0000',<HEX>,11:0]
[@33,146:145='<EOF>',<EOF>,11:7]