Grammar Tranformations

Author

Ken Pu

1 Some properties of a grammar

1.1 Left recursion

Consider a grammar G. We will define a number of properties of G. These properties will be important when we build a parser.

Left recursive grammars

A grammar $G$ is left recursive if there exists a syntactic variable $X\in\Sigma_1(G)$ such that it can derive a sential form that starts with itself. Namely, \[ X \Rightarrow X\alpha \] for some string $\alpha\in(\Sigma_0\cup\Sigma_1)^*$.

Immediate left recursive grammars

A grammar is immediate left recursive if $X$ as a production rule starting with $X$ in its body. Namely: \[ X:X\alpha \in G \]

1.2 Some examples of left recursion

1.2.1 Immediate left recursion

Consider the grammar for arithmetic expressions:

expr : expr + expr
     | expr * expr
     | Number
     ;

This is immedidate left recursive because there are two left-recursive productions:

expr : expr + expr
expr : expr * expr

1.2.2 (Non-immediate) left recursion

Consider the following:

\[ \begin{eqnarray*} A &:& B \\ B &:& Aa \end{eqnarray*} \] where $\Sigma_0 = \{a\}$ and $\Sigma_1=\{A, B\}$.

This grammar does not have any left-recursive productions, and yet is left recursive. We can show this through derivations of either $A$ or $B$.

$A\rightarrow B\rightarrow Aa$
$B\rightarrow Aa\rightarrow Ba$

1.3 Left factored grammars

Let’s first define some basic terminologies:

Prefix of strings

Given two strings $\alpha$ and $\beta$, we say that $\alpha$ is a prefix of $\beta$ if $\beta$ starts with $\alpha$:

\[ \exists \gamma.\ \beta = \alpha\gamma \]

The set of all prefixes of $\beta$ is denoted as: $\mathbf{pre}(\beta)$.

Note:

By definition, the empty string $\epsilon\in\mathbf{pre}(\beta)$ for all possible strings $\beta\in\Sigma^*$.

Example:

Given a string $\beta = abc$, it has prefixes:

\[\mathbf{pre} = \{\epsilon, a, ab, abc\}\]

There are three non-empty prefixes:

\[\{a, ab, abc\}\]

Common Prefixes

The common prefixes of two strings $\alpha$ and $\beta$ are given by $\mathbf{pre}(\alpha)\cap\mathbf{pre}(\beta)$.

It is denoted by: \[\mathbf{pre}(\alpha, \beta)\]

Left factored grammar

A grammar $G$ is left factored if for any syntactic variable $X$, no two alternations of $X$ share a non-empty common prefixes. Namely,

\[ \forall X\in\Sigma_1. \ (X:\alpha)\in G\mathrm{\ and\ } (X:\beta)\in G \quad\mathrm{implies}\quad \mathbf{pre}(\alpha,\beta) = \{\epsilon\} \]

Example

Consider the ANTLR grammar for arithmetics:
expr: expr + expr
    | expr * expr
    | Number
    ;
We see that this grammar is not left-factored because the two alternations for expr share a non-empty common prefix:
expr: expr + expr
expr: expr * expr
We have:
pre(expr + expr, expr * expr) = { expr }

1.4 Significance

We have defined two properties:

Left recursive grammars
Left factored grammars

We will show that:

Grammars that are not left recursive are easier to parse.
Grammars that are left factored are easier to parse.

Objective:

Transform any left-recursive grammars using grammar rewrite.
Transform any grammars into a left factored form using grammar rewrite.

2 Transformation Algorithms

2.1 Left factoring grammars

Let’s first define a few things.

2.1.1 Longest prefix and string substraction

Let $\alpha$ and $\beta$ be two arbitrary strings in $(\Sigma_0\cup\Sigma_1)^*$.

Longest common prefix:

\[ \mathbf{LC(\alpha, \beta)} = \mathrm{argmax}\{\mathbf{length}(s): s\in\mathbf{pre}(\alpha,\beta)\} \]

It’s the longest element in the common prefix of $\alpha$ and $\beta$.

Suffix:

If $s\in\mathbf{Pre}(\beta)$, the suffix is the string that comes after $s$, denoted by $\beta - s$. Namely: \[ \beta = s (\beta - s) \]

Example:

Suppose:

$\alpha = abcab$
$\beta = abdab$

Then, we have:

$\mathbf{LC}(\alpha, \beta) = ab$
$\alpha - a = bcab$
$\alpha - ab = cab$
$\beta - abd = ab$
$\alpha - \beta$ is undefined because $\beta\not\in\mathbf{Pre}(\alpha)$.

2.1.2 Rewriting of non-left factored grammars

Suppose a grammar is not left factored. Then, there two productions of the form:

$\mathrm{rule}_1 = A : \alpha\beta_1$
$\mathrm{rule}_2 = A : \alpha\beta_2$

where $\alpha = \mathbf{LC}(\mathrm{body}(\mathrm{rule}_1),\mathrm{body}(\mathrm{rule}_2))$

Claim

The language $L(A)$ is preserved with the grammar:

Remove the production $\mathrm{rule}_1$ and $\mathrm{rule}_2$ from the grammar.
Define a new syntactic variable $A'$.
Add the following new productions to the grammar: \[ \begin{eqnarray*} A' &=& \beta_1 | \beta_2 \tag{1} \\ A &=& \alpha A' \tag{2} \end{eqnarray*} \]

This can be extended to the more general case of multiple production rules having a common prefix.

\begin{algorithm} \caption{Left factor a grammar} \begin{algorithmic} \Procedure{LeftFactor}{$G$} \ForAll{syntactic variable $A\in G$} \ForAll{pairs of rules $r_1 = A:\beta_1$ and $r_2 = A:\beta_2$} \If{$\mathbf{pre}(\beta_1,\beta_2)\not=\{\epsilon\}$} \State Rewrite grammar $G$ to resolve $r_1$ and $r_2$. \EndIf \EndFor \EndFor \Return{isLeftFactored} \EndProcedure \end{algorithmic} \end{algorithm}

2.1.3 Example

Consider the grammar:

\[ \begin{eqnarray} E &:& E + E \\ &|& E * E \\ &|& \mathrm{number} \\ \end{eqnarray} \]

The grammar is not left-factored. We can apply the transformation:

Introduce a new variable $E'$ with the productions: \[ E' = + E | * E \]
The new grammar is:

\[ \begin{eqnarray} E &:& E E' \\ &|& \mathrm{number} \\ E' &:& + E \\ &|& * E \end{eqnarray} \]

2.2 Dealing with immediate left recursion

Reminder on notation

Given a grammar $G$ and some string $\alpha\in\Sigma^*$, the language $L(\alpha)$ is all the sentences that can be derived from $\alpha$.

\[ L(\alpha) = \{s\in\Sigma_0^*: \alpha\Rightarrow s\} \]

Now, consider an immediately left recursive production in $G$ (but we assume $G$ is left-factored):

\[ A : A\alpha | \beta_1 | \beta_2 | \dots | \beta_n \]

Claim:

\[ L(A) = \left( \bigcup_{1\leq i\leq n} L(\beta_i) \right) L(\alpha)^* \]

Note, we can rewrite the expression as: \[ \begin{eqnarray*} L(A) &=& \left(\bigcup_{1\leq i\leq n} L(\beta_i)\right) L(\alpha)^* \\ &=& \bigcup_{1\leq i\leq n} L(\beta_i) L(\alpha)^* \end{eqnarray*} \]

We can define a new syntactic variable $A':\epsilon | \alpha A'$.

Note:

$L(A') = L(\alpha)^*$
$L(A) = \bigcup_i\left(L(\beta_i)L(A')\right)$

It is easy to see that:

\[ \begin{eqnarray} A &:& \beta_1 A' | \beta_2 A' | \dots | \beta_n A' \\ A' &:& \epsilon | \alpha A' \end{eqnarray} \]

We can use this rewriting technique to remove all immediate left recursions in $G$.

2.3 Example:

Consider the grammar:

\[ \begin{eqnarray} E &:& E + E \\ &|& (\ E\ ) \\ &|& \mathrm{number} \\ \end{eqnarray} \]

This grammar is left-factored, and it is immediately left recursive. We can remove the left recursion as follows:

Identify the following in the context of: $A = A\alpha | \beta_1 | \beta_2 | \dots$.

$A = E$
$\alpha = + E$
$\beta_1 = \mathrm{number}$
$\beta_2 = (\ E\ )$

Let’s apply the grammar transformation:

Introduce a new syntactic variable: \[ \begin{eqnarray} E' &:& \epsilon \\ &|& + E E' \end{eqnarray} \]
Rewrite $E$ as: \[ \begin{eqnarray} E &:& (\ E\ )\ E' \\ &|& \mathrm{number}\ E' \end{eqnarray} \]

2.4 Dealing with general left recursion

We want to extend the grammar rewriting to handle general left recursion. This is done by the following approach:

Order all syntactic variables in $\Sigma_1$. The order does not matter, but it must be fixed throughout the algorithm.
Assign to each syntactic variable $A$ a rank $\mathrm{rank}(A)$ which is the position of $A$ in the order. The first position has $\mathrm{rank}(A) = 1$, etc.

Definition: rank violation

If we have a production of the form:

\[ X : Y\alpha \] where $X, Y\in\Sigma_1$, we require $\mathrm{rank}(X) < \mathrm{rank}(Y)$. Otherwise, we call the production $(X:Y\alpha)$ a rank violation.

Theorem:

A grammar $G$ is free of left recursion if and only if it is free of rank violations

Therefore, we will now focus on eliminating rank violations using grammar rewriting.

\begin{algorithm} \caption{Eliminate left recursion} \begin{algorithmic} \Procedure{EliminateGeneralLeftRecursion}{$G$} \State \Call{EliminateImmediateLeftRecursion}{$G$} \For{$i\in[1, |\Sigma_1|]$} \For{$j\in[1, i-1]$} \State $P\gets\{A_i:A_j\alpha\}$ \COMMENT{$P$ are rank violation since $j < i$} \IF{$P\not=\emptyset$} \STATE Let $\{\beta_k\}$ be the alternations of $A_j$. \FORALL{$(A_i:A_j\alpha)\in P$} \STATE Replace $A_i:A_j\alpha$ with $A_i:\beta_1\alpha\ |\ \beta_2\alpha\ |\ \dots\ | \beta_k\alpha$ \ENDFOR \ENDIF \EndFor \STATE\CALL{EliminateImmediateLeftRecursion}{$G$} \EndFor \EndProcedure \end{algorithmic} \end{algorithm}

Proof of correctness (outline):

The loop invariance is that up to $A_i$, we ensure that there is no rank violations after the $i$-th iteration.

2.5 Challenge:

2.5.1

Can you remove the left recursion of:

\[ \begin{eqnarray} E &:& E + E \\ &|& E * E \\ &|& (\ E\ ) \\ &|& \mathrm{number} \end{eqnarray} \]

Make sure you first perform left factoring.

2.5.2

Try to remove left recursion with two different orderings:

$S,\ A$, and then
$A,\ S$.

\[ \begin{eqnarray} S &:& Aa\ |\ b \\ A &:& Ac\ |\ Sd\ |\ \epsilon\\ \end{eqnarray} \]