First and Follow Sets

Author

Ken Pu

1 General overview of top-down parsing algorithm

We want to start with:

  • a grammar \(G\) and its start symbol \(S\).
  • an input string \(x = x_1 x_2 \dots x_n\) where \(x_i\in\Sigma_0\).

We want to generate a parse tree \(T\) such that its root is \(S\), and the leaf nodes are \(x_1, x_2, \dots, x_n\).

     S
    /|\
   .....
 /        \
|   |   |  |
x1 x2 ... xn

At each intermediate node, we have:

      X
    / | \
  z1 ... zk

where \(X:z_1\dots z_k\) is a production in \(G\).

The top-down parsing grows the tree from the root and downwards to the leaf nodes.

         S
       / | \
      .  .  .
      ?  ?  ?
     /   |   \
    x0 ...... xn

1.1 Parsing strategy

We need a way to select a production from \(G\):

\[ \mathrm{select} : (X, \mathrm{context}) \to (X:\alpha) \]

where:

  • \(X\in\Sigma_1\) is the leaf to expand.
  • \(\mathrm{context}\) is the context information (more on this later) that can help us to choose the production.
  • \(X:\alpha\) is the production to be used to expand \(X\).
        S
      / | \
     .  .  .
    /   |   \
        X  <--------- how do we expand X
       /|\
        ?

1.2 Left-most expansion

Let’s first define what a frontier is of a tree:

Definition: frontier

Given a tree \(T\), its frontier \(\mathrm{frontier}(F)\) is a list of its leaf nodes.

For example:

          S
        / | \
       A  B  A
     / |  |
    x  y  A
         / \
        B   y

The frontier would be:

\[ \mathrm{frontier}(T) = [x, y, B, y, A] \]

Heuristics in growing parse tree

We always the grow an incomplete parse tree by expanding the left most syntactic variable in the its frontier: \[ X = \mathrm{leftmost}(\mathrm{frontier}(T)) \]

2 First and Follow sets

The first and follow sets help us figure out what tokens a syntactic variable can derive, and this will help us to select productions.

2.1 First sets of strings

First sets of strings

The \(\mathrm{first}\) is a function that maps strings in \((\Sigma_1\cup\Sigma_0)^*\) to sets of terminals.

\[ \mathrm{first}:\Sigma^*\to\mathbf{SET}(\Sigma_0) \]

\[ \mathrm{first}(\alpha) \simeq \{x[0] : L(\alpha)\} \]

The intuition is that \(\mathrm{first}(\alpha)\) is all possible terminals that can be the first symbol in \(L(x)\).

  • What if \(\alpha\Rightarrow \epsilon\)?
  • It needs some special treatment because \(\epsilon[0]\) is undefined.

We will revise the definition to handle this.

First sets (extended)

\[ \mathrm{first}:\Sigma^*\to\mathbf{SET}(\Sigma_0\cup\{\epsilon\}) \]

\[ \mathrm{first}(\alpha) = \left\{\begin{array}{ll} \{x[0] : L(\alpha)\}\cup\{\epsilon\} &\mathrm{if}\ \alpha\Rightarrow\epsilon \\ \{x[0] : L(\alpha)\} &\mathrm{else} \end{array}\right. \]

2.2 Follow-sets of syntactic variables

Follow-sets of syntactic variables

The \(\mathrm{follow}\) function maps syntactic variables to sets of terminals.

\[ \mathrm{follow} : \Sigma_1 \to \mathbf{SET}(\Sigma_0) \]

Given a syntactic variable \(Z\in\Sigma_1\). If \(S\Rightarrow \alpha {\color{red} Z a}\beta\) for some \(\alpha, \beta\), and \(a\in\Sigma_0\), then \(a\in\mathrm{follow}(Z)\).

The intuition is that \(\mathrm{follow}(Z)\) is all possible terminals that can follow \(Z\) in a sentential form of the grammar \(G\).

  • What if \(S\Rightarrow\alpha Z\)?
  • It needs special attention because nothing follows \(Z\).
Follow-sets (extended)

\[ \mathrm{follow} : \Sigma_1 \to \mathbf{SET}(\Sigma_0\cup\{\$\}) \]

Given a syntactic variable \(Z\in\Sigma_1\).

  • If \(S\Rightarrow \alpha {\color{red} Z a}\beta\) for some \(\alpha, \beta\), and \(a\in\Sigma_0\), then \(a\in\mathrm{follow}(Z)\).

  • If \(S\Rightarrow \alpha {\color{red} Z}\), then \(\$\in\mathrm{follow}(Z)\)

3 Algorithms to compute \(\mathrm{first}\) and \(\mathrm{follow}\)

3.1 Computing \(\mathrm{FirstSet}\) for a single symbol

\begin{algorithm} \caption{Computing the FIRST sets} \begin{algorithmic} \Require{$X\in\Sigma_1\cup\Sigma_0$} \Procedure{FirstSet}{$X$} \State $F\gets\emptyset$ \IF{$X\in\Sigma_0$} \STATE{$F\gets \{X\}$} \ELSE \FORALL{$X:Y_1 Y_2 \dots Y_k\in G$} \IF{$k = 0$} \STATE $F\gets F\cup\{\epsilon\}$ \ELSE \FORALL{$i\in[1, k]$} \STATE $F_y = \mathrm{FIRSTSET}(Y_i)$ \STATE $F\gets F\cup({\color{red} F_y-\{\epsilon\}})$ \IF{$\epsilon\not\in F_y$} \BREAK \ENDIF \ENDFOR \IF {$i = k$} \STATE $F\gets F\cup\{\epsilon\}$ \ENDIF \ENDIF \ENDFOR \ENDIF \EndProcedure \end{algorithmic} \end{algorithm}
  • What is the termination condition?
  • The algorithm terminates if and only if the grammar is not left-recursive.

3.2 \(\mathrm{FirstSet}\) for strings

\begin{algorithm} \caption{Computing the FIRST sets} \begin{algorithmic} \Require{$\alpha\in(\Sigma_1\cup\Sigma_0)^*$} \Procedure{FirstSet}{$\alpha$} \STATE let $Y_1 Y_2 \dots Y_n = \alpha$ \STATE $F\gets\emptyset$ \FOR{$i\in[1\to n]$} \STATE $F_y = \mathrm{FIRSTSET}(Y_i)$ \STATE $F\gets F\cup(F_y - \{\epsilon\})$ \IF{$\epsilon\not\in F_y$} \BREAK \ENDIF \ENDFOR \IF{$i = n$} \STATE $F\gets F\cup\{\epsilon\}$ \ENDIF \EndProcedure \end{algorithmic} \end{algorithm}

3.3 Computing \(\mathrm{FOLLOW}\) Sets

The follow-sets are computed by an iterative algorithm that uses the \(\mathrm{FIRSTSET}\) sets. The algorithm computes all FOLLOW-sets simultaneously.

\begin{algorithm} \caption{Computing the FOLLOW sets} \begin{algorithmic} \Require{$G$ is a set of productions} \Require{$S$ is the start symbol} \Procedure{FollowSets}{$G, S$} \STATE $F\gets\mathrm{empty\ HashMap}\left<\Sigma_1\to\mathrm{SET}(\Sigma_0)\right>$ \STATE $F[S]\gets\{\$\}$ \REPEAT \FORALL{$X:Y_1 Y_2\dots Y_k\in G$} \IF{$Y_i\in\Sigma_1$} \STATE \COMMENT{Update the follow-set of $Y_i$} \STATE $F'[Y_i]\gets F[Y_i]\cup\mathrm{FirstSet}(Y_{i+1}\dots Y_k)$ \IF{$i = k$ or $\epsilon\in\mathrm{FirstSet}(Y_{i+1}\dots Y_k)$} \STATE $F'[Y_i]\gets F[Y_i]\cup F[X]$ \ENDIF \ENDIF \ENDFOR \IF{$F' = F$} \BREAK \ELSE \STATE $F\gets F'$ \ENDIF \UNTIL{} \EndProcedure \end{algorithmic} \end{algorithm}