Grammar Tranformations

Author

Ken Pu

1 Some properties of a grammar

1.1 Left recursion

Consider a grammar G. We will define a number of properties of G. These properties will be important when we build a parser.

Left recursive grammars

A grammar \(G\) is left recursive if there exists a syntactic variable \(X\in\Sigma_1(G)\) such that it can derive a sential form that starts with itself. Namely, \[ X \Rightarrow X\alpha \] for some string \(\alpha\in(\Sigma_0\cup\Sigma_1)^*\).

Immediate left recursive grammars

A grammar is immediate left recursive if \(X\) as a production rule starting with \(X\) in its body. Namely: \[ X:X\alpha \in G \]

1.2 Some examples of left recursion

1.2.1 Immediate left recursion

Consider the grammar for arithmetic expressions:

expr : expr + expr
     | expr * expr
     | Number
     ;

This is immedidate left recursive because there are two left-recursive productions:

  • expr : expr + expr
  • expr : expr * expr

1.2.2 (Non-immediate) left recursion

Consider the following:

\[ \begin{eqnarray*} A &:& B \\ B &:& Aa \end{eqnarray*} \] where \(\Sigma_0 = \{a\}\) and \(\Sigma_1=\{A, B\}\).

This grammar does not have any left-recursive productions, and yet is left recursive. We can show this through derivations of either \(A\) or \(B\).

  • \(A\rightarrow B\rightarrow Aa\)
  • \(B\rightarrow Aa\rightarrow Ba\)

1.3 Left factored grammars

Let’s first define some basic terminologies:

Prefix of strings

Given two strings \(\alpha\) and \(\beta\), we say that \(\alpha\) is a prefix of \(\beta\) if \(\beta\) starts with \(\alpha\):

\[ \exists \gamma.\ \beta = \alpha\gamma \]

The set of all prefixes of \(\beta\) is denoted as: \(\mathbf{pre}(\beta)\).

Note:

By definition, the empty string \(\epsilon\in\mathbf{pre}(\beta)\) for all possible strings \(\beta\in\Sigma^*\).

Example:

Given a string \(\beta = abc\), it has prefixes:

\[\mathbf{pre} = \{\epsilon, a, ab, abc\}\]

There are three non-empty prefixes:

\[\{a, ab, abc\}\]

Common Prefixes

The common prefixes of two strings \(\alpha\) and \(\beta\) are given by \(\mathbf{pre}(\alpha)\cap\mathbf{pre}(\beta)\).

It is denoted by: \[\mathbf{pre}(\alpha, \beta)\]

Left factored grammar

A grammar \(G\) is left factored if for any syntactic variable \(X\), no two alternations of \(X\) share a non-empty common prefixes. Namely,

\[ \forall X\in\Sigma_1. \ (X:\alpha)\in G\mathrm{\ and\ } (X:\beta)\in G \quad\mathrm{implies}\quad \mathbf{pre}(\alpha,\beta) = \{\epsilon\} \]

Example

Consider the ANTLR grammar for arithmetics:

expr: expr + expr
    | expr * expr
    | Number
    ;

We see that this grammar is not left-factored because the two alternations for expr share a non-empty common prefix:

expr: expr + expr
expr: expr * expr

We have:

pre(expr + expr, expr * expr) = { expr }

1.4 Significance

We have defined two properties:

  1. Left recursive grammars
  2. Left factored grammars

We will show that:

  1. Grammars that are not left recursive are easier to parse.
  2. Grammars that are left factored are easier to parse.
Objective:
  • Transform any left-recursive grammars using grammar rewrite.
  • Transform any grammars into a left factored form using grammar rewrite.

2 Transformation Algorithms

2.1 Left factoring grammars

Let’s first define a few things.

2.1.1 Longest prefix and string substraction

Let \(\alpha\) and \(\beta\) be two arbitrary strings in \((\Sigma_0\cup\Sigma_1)^*\).

Longest common prefix:

\[ \mathbf{LC(\alpha, \beta)} = \mathrm{argmax}\{\mathbf{length}(s): s\in\mathbf{pre}(\alpha,\beta)\} \]

It’s the longest element in the common prefix of \(\alpha\) and \(\beta\).

Suffix:

If \(s\in\mathbf{Pre}(\beta)\), the suffix is the string that comes after \(s\), denoted by \(\beta - s\). Namely: \[ \beta = s (\beta - s) \]

Example:

Suppose:

  • \(\alpha = abcab\)
  • \(\beta = abdab\)

Then, we have:

  • \(\mathbf{LC}(\alpha, \beta) = ab\)
  • \(\alpha - a = bcab\)
  • \(\alpha - ab = cab\)
  • \(\beta - abd = ab\)
  • \(\alpha - \beta\) is undefined because \(\beta\not\in\mathbf{Pre}(\alpha)\).

2.1.2 Rewriting of non-left factored grammars

Suppose a grammar is not left factored. Then, there two productions of the form:

  • \(\mathrm{rule}_1 = A : \alpha\beta_1\)
  • \(\mathrm{rule}_2 = A : \alpha\beta_2\)

where \(\alpha = \mathbf{LC}(\mathrm{body}(\mathrm{rule}_1),\mathrm{body}(\mathrm{rule}_2))\)

Claim

The language \(L(A)\) is preserved with the grammar:

  1. Remove the production \(\mathrm{rule}_1\) and \(\mathrm{rule}_2\) from the grammar.

  2. Define a new syntactic variable \(A'\).

  3. Add the following new productions to the grammar: \[ \begin{eqnarray*} A' &=& \beta_1 | \beta_2 \tag{1} \\ A &=& \alpha A' \tag{2} \end{eqnarray*} \]

This can be extended to the more general case of multiple production rules having a common prefix.

\begin{algorithm} \caption{Left factor a grammar} \begin{algorithmic} \Procedure{LeftFactor}{$G$} \ForAll{syntactic variable $A\in G$} \ForAll{pairs of rules $r_1 = A:\beta_1$ and $r_2 = A:\beta_2$} \If{$\mathbf{pre}(\beta_1,\beta_2)\not=\{\epsilon\}$} \State Rewrite grammar $G$ to resolve $r_1$ and $r_2$. \EndIf \EndFor \EndFor \Return{isLeftFactored} \EndProcedure \end{algorithmic} \end{algorithm}

2.1.3 Example

Consider the grammar:

\[ \begin{eqnarray} E &:& E + E \\ &|& E * E \\ &|& \mathrm{number} \\ \end{eqnarray} \]

The grammar is not left-factored. We can apply the transformation:

  • Introduce a new variable \(E'\) with the productions: \[ E' = + E | * E \]

  • The new grammar is:

    \[ \begin{eqnarray} E &:& E E' \\ &|& \mathrm{number} \\ E' &:& + E \\ &|& * E \end{eqnarray} \]

2.2 Dealing with immediate left recursion

Reminder on notation

Given a grammar \(G\) and some string \(\alpha\in\Sigma^*\), the language \(L(\alpha)\) is all the sentences that can be derived from \(\alpha\).

\[ L(\alpha) = \{s\in\Sigma_0^*: \alpha\Rightarrow s\} \]

Now, consider an immediately left recursive production in \(G\) (but we assume \(G\) is left-factored):

\[ A : A\alpha | \beta_1 | \beta_2 | \dots | \beta_n \]

Claim:

\[ L(A) = \left( \bigcup_{1\leq i\leq n} L(\beta_i) \right) L(\alpha)^* \]

Note, we can rewrite the expression as: \[ \begin{eqnarray*} L(A) &=& \left(\bigcup_{1\leq i\leq n} L(\beta_i)\right) L(\alpha)^* \\ &=& \bigcup_{1\leq i\leq n} L(\beta_i) L(\alpha)^* \end{eqnarray*} \]

We can define a new syntactic variable \(A':\epsilon | \alpha A'\).

Note:

  • \(L(A') = L(\alpha)^*\)
  • \(L(A) = \bigcup_i\left(L(\beta_i)L(A')\right)\)

It is easy to see that:

\[ \begin{eqnarray} A &:& \beta_1 A' | \beta_2 A' | \dots | \beta_n A' \\ A' &:& \epsilon | \alpha A' \end{eqnarray} \]

We can use this rewriting technique to remove all immediate left recursions in \(G\).

2.3 Example:

Consider the grammar:

\[ \begin{eqnarray} E &:& E + E \\ &|& (\ E\ ) \\ &|& \mathrm{number} \\ \end{eqnarray} \]

This grammar is left-factored, and it is immediately left recursive. We can remove the left recursion as follows:

Identify the following in the context of: \(A = A\alpha | \beta_1 | \beta_2 | \dots\).

  • \(A = E\)
  • \(\alpha = + E\)
  • \(\beta_1 = \mathrm{number}\)
  • \(\beta_2 = (\ E\ )\)

Let’s apply the grammar transformation:

  • Introduce a new syntactic variable: \[ \begin{eqnarray} E' &:& \epsilon \\ &|& + E E' \end{eqnarray} \]

  • Rewrite \(E\) as: \[ \begin{eqnarray} E &:& (\ E\ )\ E' \\ &|& \mathrm{number}\ E' \end{eqnarray} \]

2.4 Dealing with general left recursion

We want to extend the grammar rewriting to handle general left recursion. This is done by the following approach:

  • Order all syntactic variables in \(\Sigma_1\). The order does not matter, but it must be fixed throughout the algorithm.

  • Assign to each syntactic variable \(A\) a rank \(\mathrm{rank}(A)\) which is the position of \(A\) in the order. The first position has \(\mathrm{rank}(A) = 1\), etc.

Definition: rank violation

If we have a production of the form:

\[ X : Y\alpha \] where \(X, Y\in\Sigma_1\), we require \(\mathrm{rank}(X) < \mathrm{rank}(Y)\). Otherwise, we call the production \((X:Y\alpha)\) a rank violation.

Theorem:

A grammar \(G\) is free of left recursion if and only if it is free of rank violations

Therefore, we will now focus on eliminating rank violations using grammar rewriting.

\begin{algorithm} \caption{Eliminate left recursion} \begin{algorithmic} \Procedure{EliminateGeneralLeftRecursion}{$G$} \State \Call{EliminateImmediateLeftRecursion}{$G$} \For{$i\in[1, |\Sigma_1|]$} \For{$j\in[1, i-1]$} \State $P\gets\{A_i:A_j\alpha\}$ \COMMENT{$P$ are rank violation since $j < i$} \IF{$P\not=\emptyset$} \STATE Let $\{\beta_k\}$ be the alternations of $A_j$. \FORALL{$(A_i:A_j\alpha)\in P$} \STATE Replace $A_i:A_j\alpha$ with $A_i:\beta_1\alpha\ |\ \beta_2\alpha\ |\ \dots\ | \beta_k\alpha$ \ENDFOR \ENDIF \EndFor \STATE\CALL{EliminateImmediateLeftRecursion}{$G$} \EndFor \EndProcedure \end{algorithmic} \end{algorithm}
Proof of correctness (outline):

The loop invariance is that up to \(A_i\), we ensure that there is no rank violations after the \(i\)-th iteration.

2.5 Challenge:

2.5.1

Can you remove the left recursion of:

\[ \begin{eqnarray} E &:& E + E \\ &|& E * E \\ &|& (\ E\ ) \\ &|& \mathrm{number} \end{eqnarray} \]

Make sure you first perform left factoring.

2.5.2

Try to remove left recursion with two different orderings:

  1. \(S,\ A\), and then
  2. \(A,\ S\).

\[ \begin{eqnarray} S &:& Aa\ |\ b \\ A &:& Ac\ |\ Sd\ |\ \epsilon\\ \end{eqnarray} \]