Advanced lexical analysis
1 About
We will demonstrate several advanced features of ANTLR lexer rules.
- More structured way to build patterns - We will make use of fragments to have reusable patterns. 
- Channels - Tokens can be streamed through different channels. So far, we have only worked the default channel when reading from - CommonTokenStream. We will see another channel: the- skipchannel.
- Modes - We will examine the need to switch context during lexical analysis. This is usually due to a phenomenon known as island language. 
2 A Case Study
We are going to build a calculator language that supports:
- different ways to specify numbers.
- comments
2.1 An example
// income is 45.50 per hour.
45.5
// expense is $900, but it's negative.
-900
// count of students in my class
42
// HTML color for red
xff00003 The Lexical Rules
3.1 Number specification
3.1.1 Digits
fragment DIGIT     : '0' .. '9';
fragment HEX_DIGIT : ('a' .. 'f') | ('A' .. 'F') | DIGIT;Note: we can reuse DIGIT in the definition of HEX_DIGIT.
3.1.2 Numeric values
INTEGER : '-'? ('1' .. '9') DIGIT*;
HEX     : 'x' HEX_DIGIT+;
DECIMAL : (INTEGER '.') 
        | (INTEGER '.' DIGIT+)
        ;DIGIT is reused at several places.
3.2 More formatting
We need to incorporate whitespaces and comments.
3.2.1 Whitespace
WHITESPACE : (' ' | '\t' | '\r' | '\n') ;But note that we actually don’t care about whitespaces. This means that we can ignore all tokens of the type WHITESPACE. ANTLR allows us to do this by directing tokens to the skip channel.
WHITESPACE : (' ' | '\t' | '\r' | '\n') -> skip;3.2.3 Entering comment mode
COMMENT: '//' -> pushMode(MODE_COMMENT);3.2.4 Comment mode
mode MODE_COMMENT;
WHITESPACE_COMMENT : (' ' | '\t') -> skip;
WORD_COMMENT       : ~('\n' | ' ' | '\t')+;
END_COMMENT        : '\n' -> popMode;4 Testing it
4.1 Prepare the lexer class
$ java -jar /antlr-4.11.1-complete.jar *.g4
$ java -cp /antlr-4.11.1-complete.jar:. *.java4.2 Input file
// income is 45.50 per hour.
45.5
// expense is $900, but it's negative.
-900
// count of students in my class
42
// HTML color for red
xff00004.3 Test the lexer with testrig
$ java -cp /antlr-4.11.1-complete.jar:$PWD \
       org.antlr.v4.gui.TestRig \
       MyLexer \
       tokens -tokens sample.txtThe output is:
[@0,0:1='//',<'//'>,1:0]                      enters MODE_COMMENT
[@1,3:8='income',<COMMENT_WORD>,1:3]
[@2,10:11='is',<COMMENT_WORD>,1:10]
[@3,13:17='45.50',<COMMENT_WORD>,1:13]
[@4,19:21='per',<COMMENT_WORD>,1:19]
[@5,23:27='hour.',<COMMENT_WORD>,1:23]
[@6,28:28='\n',<'\n'>,1:28]                   exists MODE_COMMENT
[@7,29:32='45.5',<DECIMAL>,2:0]               enters DEFAULT_MODE
[@8,35:36='//',<'//'>,4:0]
[@9,38:44='expense',<COMMENT_WORD>,4:3]
[@10,46:47='is',<COMMENT_WORD>,4:11]
[@11,49:53='$900,',<COMMENT_WORD>,4:14]
[@12,55:57='but',<COMMENT_WORD>,4:20]
[@13,59:62='it's',<COMMENT_WORD>,4:24]
[@14,64:72='negative.',<COMMENT_WORD>,4:29]
[@15,73:73='\n',<'\n'>,4:38]
[@16,74:77='-900',<INTEGER>,5:0]
[@17,80:81='//',<'//'>,7:0]
[@18,83:87='count',<COMMENT_WORD>,7:3]
[@19,89:90='of',<COMMENT_WORD>,7:9]
[@20,92:99='students',<COMMENT_WORD>,7:12]
[@21,101:102='in',<COMMENT_WORD>,7:21]
[@22,104:105='my',<COMMENT_WORD>,7:24]
[@23,107:111='class',<COMMENT_WORD>,7:27]
[@24,112:112='\n',<'\n'>,7:32]
[@25,113:114='42',<INTEGER>,8:0]
[@26,117:118='//',<'//'>,10:0]
[@27,120:123='HTML',<COMMENT_WORD>,10:3]
[@28,125:129='color',<COMMENT_WORD>,10:8]
[@29,131:133='for',<COMMENT_WORD>,10:14]
[@30,135:137='red',<COMMENT_WORD>,10:18]
[@31,138:138='\n',<'\n'>,10:21]
[@32,139:145='xff0000',<HEX>,11:0]
[@33,146:145='<EOF>',<EOF>,11:7]
3.2.2 Comments and mode based analyze
Comments are challenging become the comment section of the code is an island. By definition, inside of the comment is disconnected from the rest of the code, and thus, all tokens of the program should not be from a comment section.
This can be achieved by mode based lexical analysis.
mode.-> pushMode(NEW_MODE)will switch the mode toNEW_MODE.-> popModerestores the mode to the previous mode.The lexer rules are organized as: