Regular Expressions
1 Regular expressions
We will consider the basic version of regular expressions.
Regular expressions are expressions in a language over alphabet \(\Sigma\).
- Base case: - If a symbol \(x\in\Sigma\), then \(x\) is a RE with the language \(L(x) = \{x\}\) 
- Concatenation: - If \(e_1\) and \(e_2\) are RE, then \(e_1 e_2\) is a RE with the language \(L(e_1e_2) = \{st: s\in L(e_1), t\in L(e_2)\}\). 
- Alternation: - If \(e_1\) and \(e_2\) are RE, then \(e_1 | e_2\) is a RE with the language \(L(e_1|e_2) = L(e_1)\cup L(e_2)\). 
- Optional: - If \(e\) is RE, then \(e?\) is a RE with \(L(e?) = L(e)\cup\{\epsilon\}\). 
- Repeatition: - If \(e\) is RE, then \(e*\) is a RE with \(L(e*) = \bigcup_{k=0}^{\infty}\{s_1s_2\dots s_k: s_i\in L(e)\}\) 
A language \(L\) is regular if there exists some RE \(e\) such that \(L = L(e)\).
All regular languages are denoted as \(\mathcal{L}(\mathbf{RE})\).
So far, we have presented three ways of defining languages:
- \(\mathcal{L}(\mathbf{NFA})\)
- \(\mathcal{L}(\mathbf{DFA})\)
- \(\mathcal{L}(\mathbf{RE})\)
It turns out that they are all equivalent.
2 Thompson’s Construction
Thompson’s construction is an algorithm that converts a RE to a NFA with the equivalent language.
\(N(a)\)

\(N(s|t)\)

\(N(st)\)

\(N(s*)\)

\[ \mathcal{L}(\mathbf{RE}) = \mathcal{L}(\mathbf{NFA}) = \mathcal{L}(\mathbf{DFA}) \]
3 Limitations of Regular Languages
Consider a family of languages:
- \(L_0 = \{\epsilon\}\) 
- \(L_1 = \{01\} \cup L_0\) 
- \(L_2 = \{0011\} \cup L_1\) 
- \(L_3 = \{000111\} \cup L_2\) 
- \(\vdots\) 
- \(L_n = \{\underbrace{0\dots 0}_{n}\ \underbrace{1\dots 1}_{n}\} \cup L_{n-1}\) 
Claim:
For any specific \(n\), \(L_n\) is regular. Here is the NFA for \(L_3\).
Observe however, as \(n\) increases, \(\mathbf{NFA}(L_n)\) requires more states.
Now, consider the language \(L_\infty\). This is the language containing any string with 0’s followed by 1’s with the same number of 0’s and 1’s.
\[ L_\infty\not\in\mathcal{L}(\mathbf{NFA}) \]
By the theorem, \(L_\infty\) is not RE.
4 RE, NFA, DFA and Anltr
- We define lexical rules in Antlr using RE. 
- Antlr converts all lexical rules into a single combined NFA for efficiency. 
- Antlr converts the combined NFA into a DFA for memory efficiency. 
- Lexical analysis in Antlr is done by running the DFA that corresponds to all the user-defined token patterns as regular expressions.