General overview of top-down parsing algorithm
We want to start with:
- a grammar \(G\) and its start symbol \(S\).
- an input string \(x = x_1 x_2 \dots x_n\) where \(x_i\in\Sigma_0\).
We want to generate a parse tree \(T\) such that its root is \(S\), and the leaf nodes are \(x_1, x_2, \dots, x_n\).
S
/|\
.....
/ \
| | | |
x1 x2 ... xn
At each intermediate node, we have:
X
/ | \
z1 ... zk
where \(X:z_1\dots z_k\) is a production in \(G\).
The top-down parsing grows the tree from the root and downwards to the leaf nodes.
S
/ | \
. . .
? ? ?
/ | \
x0 ...... xn
Parsing strategy
We need a way to select a production from \(G\):
\[ \mathrm{select} : (X, \mathrm{context}) \to (X:\alpha) \]
where:
- \(X\in\Sigma_1\) is the leaf to expand.
- \(\mathrm{context}\) is the context information (more on this later) that can help us to choose the production.
- \(X:\alpha\) is the production to be used to expand \(X\).
S
/ | \
. . .
/ | \
X <--------- how do we expand X
/|\
?
Left-most expansion
Let’s first define what a frontier is of a tree:
Given a tree \(T\), its frontier \(\mathrm{frontier}(F)\) is a list of its leaf nodes.
For example:
S
/ | \
A B A
/ | |
x y A
/ \
B y
The frontier would be:
\[
\mathrm{frontier}(T) = [x, y, B, y, A]
\]
We always the grow an incomplete parse tree by expanding the left most syntactic variable in the its frontier: \[ X = \mathrm{leftmost}(\mathrm{frontier}(T)) \]
Given \(X\) is to be expanded, which production \(X:\alpha\) should we pick?
- Must know what tokens \(X\) is responsible to produce.
- First and follow sets of \(X\).
First and Follow sets
The first and follow sets help us figure out what tokens a syntactic variable can derive, and this will help us to select productions.
First sets of strings
The \(\mathrm{first}\) is a function that maps strings in \((\Sigma_1\cup\Sigma_0)^*\) to sets of terminals.
\[
\mathrm{first}:\Sigma^*\to\mathbf{SET}(\Sigma_0)
\]
\[
\mathrm{first}(\alpha) \simeq \{x[0] : L(\alpha)\}
\]
The intuition is that \(\mathrm{first}(\alpha)\) is all possible terminals that can be the first symbol in \(L(x)\).
- What if \(\alpha\Rightarrow \epsilon\)?
- It needs some special treatment because \(\epsilon[0]\) is undefined.
We will revise the definition to handle this.
\[
\mathrm{first}:\Sigma^*\to\mathbf{SET}(\Sigma_0\cup\{\epsilon\})
\]
\[
\mathrm{first}(\alpha) =
\left\{\begin{array}{ll}
\{x[0] : L(\alpha)\}\cup\{\epsilon\} &\mathrm{if}\ \alpha\Rightarrow\epsilon \\
\{x[0] : L(\alpha)\} &\mathrm{else}
\end{array}\right.
\]
Follow-sets of syntactic variables
The \(\mathrm{follow}\) function maps syntactic variables to sets of terminals.
\[
\mathrm{follow} : \Sigma_1 \to \mathbf{SET}(\Sigma_0)
\]
Given a syntactic variable \(Z\in\Sigma_1\). If \(S\Rightarrow \alpha {\color{red} Z a}\beta\) for some \(\alpha, \beta\), and \(a\in\Sigma_0\), then \(a\in\mathrm{follow}(Z)\).
The intuition is that \(\mathrm{follow}(Z)\) is all possible terminals that can follow \(Z\) in a sentential form of the grammar \(G\).
- What if \(S\Rightarrow\alpha Z\)?
- It needs special attention because nothing follows \(Z\).
\[
\mathrm{follow} : \Sigma_1 \to \mathbf{SET}(\Sigma_0\cup\{\$\})
\]
Given a syntactic variable \(Z\in\Sigma_1\).
If \(S\Rightarrow \alpha {\color{red} Z a}\beta\) for some \(\alpha, \beta\), and \(a\in\Sigma_0\), then \(a\in\mathrm{follow}(Z)\).
If \(S\Rightarrow \alpha {\color{red} Z}\), then \(\$\in\mathrm{follow}(Z)\)
Algorithms to compute \(\mathrm{first}\) and \(\mathrm{follow}\)
Computing \(\mathrm{FirstSet}\) for a single symbol
\begin{algorithm} \caption{Computing the FIRST sets} \begin{algorithmic} \Require{$X\in\Sigma_1\cup\Sigma_0$} \Procedure{FirstSet}{$X$} \State $F\gets\emptyset$ \IF{$X\in\Sigma_0$} \STATE{$F\gets \{X\}$} \ELSE \FORALL{$X:Y_1 Y_2 \dots Y_k\in G$} \IF{$k = 0$} \STATE $F\gets F\cup\{\epsilon\}$ \ELSE \FORALL{$i\in[1, k]$} \STATE $F_y = \mathrm{FIRSTSET}(Y_i)$ \STATE $F\gets F\cup({\color{red} F_y-\{\epsilon\}})$ \IF{$\epsilon\not\in F_y$} \BREAK \ENDIF \ENDFOR \IF {$i = k$} \STATE $F\gets F\cup\{\epsilon\}$ \ENDIF \ENDIF \ENDFOR \ENDIF \EndProcedure \end{algorithmic} \end{algorithm}
- What is the termination condition?
- The algorithm terminates if and only if the grammar is not left-recursive.
\(\mathrm{FirstSet}\) for strings
\begin{algorithm} \caption{Computing the FIRST sets} \begin{algorithmic} \Require{$\alpha\in(\Sigma_1\cup\Sigma_0)^*$} \Procedure{FirstSet}{$\alpha$} \STATE let $Y_1 Y_2 \dots Y_n = \alpha$ \STATE $F\gets\emptyset$ \FOR{$i\in[1\to n]$} \STATE $F_y = \mathrm{FIRSTSET}(Y_i)$ \STATE $F\gets F\cup(F_y - \{\epsilon\})$ \IF{$\epsilon\not\in F_y$} \BREAK \ENDIF \ENDFOR \IF{$i = n$} \STATE $F\gets F\cup\{\epsilon\}$ \ENDIF \EndProcedure \end{algorithmic} \end{algorithm}
Computing \(\mathrm{FOLLOW}\) Sets
The follow-sets are computed by an iterative algorithm that uses the \(\mathrm{FIRSTSET}\) sets. The algorithm computes all FOLLOW-sets simultaneously.
\begin{algorithm} \caption{Computing the FOLLOW sets} \begin{algorithmic} \Require{$G$ is a set of productions} \Require{$S$ is the start symbol} \Procedure{FollowSets}{$G, S$} \STATE $F\gets\mathrm{empty\ HashMap}\left<\Sigma_1\to\mathrm{SET}(\Sigma_0)\right>$ \STATE $F[S]\gets\{\$\}$ \REPEAT \FORALL{$X:Y_1 Y_2\dots Y_k\in G$} \IF{$Y_i\in\Sigma_1$} \STATE \COMMENT{Update the follow-set of $Y_i$} \STATE $F'[Y_i]\gets F[Y_i]\cup\mathrm{FirstSet}(Y_{i+1}\dots Y_k)$ \IF{$i = k$ or $\epsilon\in\mathrm{FirstSet}(Y_{i+1}\dots Y_k)$} \STATE $F'[Y_i]\gets F[Y_i]\cup F[X]$ \ENDIF \ENDIF \ENDFOR \IF{$F' = F$} \BREAK \ELSE \STATE $F\gets F'$ \ENDIF \UNTIL{} \EndProcedure \end{algorithmic} \end{algorithm}
In next lecture, we will need to figure out how to use \(\mathrm{FirstSet}\) and \(\mathrm{FollowSets}\) to perform production selection.