Grammar Tranformations
1 Some properties of a grammar
1.1 Left recursion
Consider a grammar G. We will define a number of properties of G. These properties will be important when we build a parser.
1.2 Some examples of left recursion
1.2.1 Immediate left recursion
Consider the grammar for arithmetic expressions:
expr : expr + expr
| expr * expr
| Number
;
This is immedidate left recursive because there are two left-recursive productions:
expr : expr + expr
expr : expr * expr
1.2.2 (Non-immediate) left recursion
Consider the following:
\[ \begin{eqnarray*} A &:& B \\ B &:& Aa \end{eqnarray*} \] where \(\Sigma_0 = \{a\}\) and \(\Sigma_1=\{A, B\}\).
This grammar does not have any left-recursive productions, and yet is left recursive. We can show this through derivations of either \(A\) or \(B\).
- \(A\rightarrow B\rightarrow Aa\)
- \(B\rightarrow Aa\rightarrow Ba\)
1.3 Left factored grammars
Let’s first define some basic terminologies:
Note:
By definition, the empty string \(\epsilon\in\mathbf{pre}(\beta)\) for all possible strings \(\beta\in\Sigma^*\).
Example:
Given a string \(\beta = abc\), it has prefixes:
\[\mathbf{pre} = \{\epsilon, a, ab, abc\}\]
There are three non-empty prefixes:
\[\{a, ab, abc\}\]
Example
Consider the ANTLR grammar for arithmetics:
expr: expr + expr | expr * expr | Number ;
We see that this grammar is not left-factored because the two alternations for
expr
share a non-empty common prefix:expr: expr + expr expr: expr * expr
We have:
pre(expr + expr, expr * expr) = { expr }
1.4 Significance
We have defined two properties:
- Left recursive grammars
- Left factored grammars
We will show that:
- Grammars that are not left recursive are easier to parse.
- Grammars that are left factored are easier to parse.
- Transform any left-recursive grammars using grammar rewrite.
- Transform any grammars into a left factored form using grammar rewrite.
2 Transformation Algorithms
2.1 Left factoring grammars
Let’s first define a few things.
2.1.1 Longest prefix and string substraction
Let \(\alpha\) and \(\beta\) be two arbitrary strings in \((\Sigma_0\cup\Sigma_1)^*\).
Example:
Suppose:
- \(\alpha = abcab\)
- \(\beta = abdab\)
Then, we have:
- \(\mathbf{LC}(\alpha, \beta) = ab\)
- \(\alpha - a = bcab\)
- \(\alpha - ab = cab\)
- \(\beta - abd = ab\)
- \(\alpha - \beta\) is undefined because \(\beta\not\in\mathbf{Pre}(\alpha)\).
2.1.2 Rewriting of non-left factored grammars
Suppose a grammar is not left factored. Then, there two productions of the form:
- \(\mathrm{rule}_1 = A : \alpha\beta_1\)
- \(\mathrm{rule}_2 = A : \alpha\beta_2\)
where \(\alpha = \mathbf{LC}(\mathrm{body}(\mathrm{rule}_1),\mathrm{body}(\mathrm{rule}_2))\)
The language \(L(A)\) is preserved with the grammar:
Remove the production \(\mathrm{rule}_1\) and \(\mathrm{rule}_2\) from the grammar.
Define a new syntactic variable \(A'\).
Add the following new productions to the grammar: \[ \begin{eqnarray*} A' &=& \beta_1 | \beta_2 \tag{1} \\ A &=& \alpha A' \tag{2} \end{eqnarray*} \]
This can be extended to the more general case of multiple production rules having a common prefix.
2.1.3 Example
Consider the grammar:
\[ \begin{eqnarray} E &:& E + E \\ &|& E * E \\ &|& \mathrm{number} \\ \end{eqnarray} \]
The grammar is not left-factored. We can apply the transformation:
Introduce a new variable \(E'\) with the productions: \[ E' = + E | * E \]
The new grammar is:
\[ \begin{eqnarray} E &:& E E' \\ &|& \mathrm{number} \\ E' &:& + E \\ &|& * E \end{eqnarray} \]
2.2 Dealing with immediate left recursion
Now, consider an immediately left recursive production in \(G\) (but we assume \(G\) is left-factored):
\[ A : A\alpha | \beta_1 | \beta_2 | \dots | \beta_n \]
\[ L(A) = \left( \bigcup_{1\leq i\leq n} L(\beta_i) \right) L(\alpha)^* \]
Note, we can rewrite the expression as: \[ \begin{eqnarray*} L(A) &=& \left(\bigcup_{1\leq i\leq n} L(\beta_i)\right) L(\alpha)^* \\ &=& \bigcup_{1\leq i\leq n} L(\beta_i) L(\alpha)^* \end{eqnarray*} \]
We can define a new syntactic variable \(A':\epsilon | \alpha A'\).
Note:
- \(L(A') = L(\alpha)^*\)
- \(L(A) = \bigcup_i\left(L(\beta_i)L(A')\right)\)
It is easy to see that:
\[ \begin{eqnarray} A &:& \beta_1 A' | \beta_2 A' | \dots | \beta_n A' \\ A' &:& \epsilon | \alpha A' \end{eqnarray} \]
We can use this rewriting technique to remove all immediate left recursions in \(G\).
2.3 Example:
Consider the grammar:
\[ \begin{eqnarray} E &:& E + E \\ &|& (\ E\ ) \\ &|& \mathrm{number} \\ \end{eqnarray} \]
This grammar is left-factored, and it is immediately left recursive. We can remove the left recursion as follows:
Identify the following in the context of: \(A = A\alpha | \beta_1 | \beta_2 | \dots\).
- \(A = E\)
- \(\alpha = + E\)
- \(\beta_1 = \mathrm{number}\)
- \(\beta_2 = (\ E\ )\)
Let’s apply the grammar transformation:
Introduce a new syntactic variable: \[ \begin{eqnarray} E' &:& \epsilon \\ &|& + E E' \end{eqnarray} \]
Rewrite \(E\) as: \[ \begin{eqnarray} E &:& (\ E\ )\ E' \\ &|& \mathrm{number}\ E' \end{eqnarray} \]
2.4 Dealing with general left recursion
We want to extend the grammar rewriting to handle general left recursion. This is done by the following approach:
Order all syntactic variables in \(\Sigma_1\). The order does not matter, but it must be fixed throughout the algorithm.
Assign to each syntactic variable \(A\) a rank \(\mathrm{rank}(A)\) which is the position of \(A\) in the order. The first position has \(\mathrm{rank}(A) = 1\), etc.
A grammar \(G\) is free of left recursion if and only if it is free of rank violations
Therefore, we will now focus on eliminating rank violations using grammar rewriting.
The loop invariance is that up to \(A_i\), we ensure that there is no rank violations after the \(i\)-th iteration.
2.5 Challenge:
2.5.1
Can you remove the left recursion of:
\[ \begin{eqnarray} E &:& E + E \\ &|& E * E \\ &|& (\ E\ ) \\ &|& \mathrm{number} \end{eqnarray} \]
Make sure you first perform left factoring.
2.5.2
Try to remove left recursion with two different orderings:
- \(S,\ A\), and then
- \(A,\ S\).
\[ \begin{eqnarray} S &:& Aa\ |\ b \\ A &:& Ac\ |\ Sd\ |\ \epsilon\\ \end{eqnarray} \]