Introduction to Context Free Grammar

Author

Ken Pu

1 Context Free Grammar

1.1 Symbols

We distinguish the alphabet into two types of symbols:

  • Terminals: \(\Sigma_0\) - they are the tokens.
  • Nonterminals: \(\Sigma_1\) - they are also called syntactic variables.

The alphabet is \(\Sigma = \Sigma_0\cup\Sigma_1\).

1.2 Start symbol

There is a special syntactic variable \(S\in\Sigma_1\) referred to ask the start symbol.

1.3 Production rules

A production is of the form:

\[ A:\alpha \]

where:

  • \(A\in\Sigma_1\) - this is called the head of the production.
  • \(\alpha\in(\Sigma_0\cup\Sigma_1)^*\) - this is called the body of the production.

1.4 Context free grammar (CFG)

A CFG is a collection of productions and a start symbol.

\[ G = \left<S, \{A_i:\alpha_i\}_{i\leq n}\right> \]

where

  • \(S\in\Sigma_1\) is the start symbol.
  • \(A_i\in\Sigma_1\) are the head of the productions.
  • \(\alpha_i\in\Sigma^*\) are the bodies of the productions.

Note: \(\alpha_i\) can be empty \(\epsilon\) for some \(i\).

\(G\) defines a language \(L(G)\subseteq\Sigma_0^*\).

Example

\[ \begin{eqnarray} E &:& ab \\ E &:& aEb \end{eqnarray} \]

Here, there is one syntactic variable \(E\), which is also the starting symbol. The two productions are:

  • \(E: ab\), where the head is \(E\) and the body \(ab\).
  • \(E: aEb\), where the head is \(E\), and the body \(aEb\).

Notation

When multiple productions share the same head, we call them alternations of the head symbol. We can group the alternations using the following notation:

\[ \begin{eqnarray} E &:& ab\ |\ aEb \end{eqnarray} \]

2 Derivation

2.1 Intuition

  • Productions are describing how the head can be replaced by its body.
  • Given a string over both syntactic variables and terminals, the objective is to replace syntactic variables using some productions.

2.2 Derivation

Given a string \(x=x_1 x_2 \dots x_n\) where \(x_i\in\Sigma_1\cup\Sigma_0\). We can perform derivation using a production of the form: \[ p_i = x_i:\alpha \] by replacing symbol \(x_i\) in \(x\) with the body \(\alpha\).

\[ x \underset{p,i}{\longrightarrow} x_1x_2\dots x_{i-1}{\color{red}\alpha} x_{i+1} x_{i+2}\dots x_n \]

2.3 An example

  • \(E \underset{E:aEb}{\longrightarrow} {\color{red}aEb}\)

  • \(aEb \underset{E:ab}{\longrightarrow} a{\color{red}ab}b\)

  • \(aEb \underset{E:aEb}{\longrightarrow} a{\color{red}aEb}b\)

2.4 More on derivation

We generalize the one step derivation \(\rightarrow\) to more steps.

  • \(x\Rightarrow x\)
  • If \(x\Rightarrow y\) and \(y\underset{p,i}{\longrightarrow}z\), then \(x\Rightarrow z\).

Therefore, in the example above, we have:

\[ E\Rightarrow aaEbb \]

3 Defining a language using CFG

Definition

Consider a string \(\alpha\in(\Sigma_0\cup\Sigma_1)^*\), and some CFG with start symbol \(S\).

  • \(\alpha\) is a sentential form of \(S\) if \(S\Rightarrow\alpha\).

  • \(\alpha\) is a sentence of \(S\) if it is a sentential form of \(S\) and only contains terminal symbols: \[S\Rightarrow\alpha\ \mathrm{and}\ \alpha\in\Sigma_0^*\]

  • \(L(S)\) is all the sentences of \(S\).

Example

\[L(E) = \{a^n b^n: n\geq 1\}\]

Thus, \(L(E)\) is not regular.

4 Parse Trees

  • Grammar is the syntax

  • Strings are the programs

  • Parse trees are the interpretations

4.1 Example

Consider the grammar \(G\) given as:

\[ \begin{eqnarray} S &:& aSSb \\ &|& 0 \\ &|& 1 \end{eqnarray} \]

It’s easy to see that \(a01b\in L(S)\). The derivation is:

\[ \begin{eqnarray} && S \\ &\Rightarrow& aSSb \\ &\Rightarrow& a0Sb \\ &\Rightarrow& a01b \end{eqnarray} \]

We can represent the productions applied as a tree.

                S
                |
        +----+--+--+----+
        |    |     |    |
        |    S     S    |
        |    |     |    |
       'a'  '0'   '1'  'b'
  • The root of the tree is the start symbol.
  • The intermediate nodes are the syntactic variables.
  • The leaf nodes are the terminals (aka tokens).
  • Each parent node is a head of the production, and its children are the body of the production.

4.2 Case study: arithmetics

Let’s consider a grammar that defines the language for arithmetic expressions over tokens of type Number with + and * operators.

Here are some examples:

  • 0
  • 1 + 4 * 8
  • 8 * 3 + 5
  • 8 * 3 + 5 + 8 * 3 + 5 + 8 * 3 + 5
  • The digits will converted into tokens Number by the lexical analyzer.
  • Thus, the grammar will only need to consider tokens of type Number instead of the lexemes.

We can define the grammar as follows:

Tokens

\(\Sigma_0 = \{{\color{red}\mathrm{+}}, {\color{red}\mathrm{*}}, {\color{red}\mathrm{Number}}\}\)

Productions

\[ \begin{eqnarray} E &:& \mathrm{Number} \\ &|& E + E \\ &|& E * E \end{eqnarray} \]

Derivation of 1 + 4 * 8:

\[ \begin{eqnarray} E &\to& E + E \\ &\to& 1:\mathrm{Number} + E \\ &\to& 1:\mathrm{Number} + E * E \\ &\to& 1:\mathrm{Number} + 4:\mathrm{Number} * E \\ &\to& 1:\mathrm{Number} + 4:\mathrm{Number} * 8:\mathrm{Number} \\ \end{eqnarray} \]

A Preview of ambiguity

Consider 1 + 4 * 8. Do we add then multiply, or multiply then add?

Let’s examine the parse tree:

\[ \begin{eqnarray} E &\to& E + E \\ &\to& 1 + E \\ &\to& 1 + E * E \\ &\to& 1 + 4 * E \\ &\to& 1 + 4 * 8 \\ \end{eqnarray} \]

         E
       / | \
      E '+' E
      |   / | \
     '1' E '*' E
         |     |
        '4'   '8'

But this is not the only possible derivation.

\[ \begin{eqnarray} E &\to& {\color{red}E * E} \\ &\to& {\color{red}E + E} * E \\ &\to& {\color{red}1} + E * E \\ &\to& 1 + {\color{red}4} * E \\ &\to& 1 + 4 * {\color{red}8} \\ \end{eqnarray} \]

         E
       / | \
      E '*' E
    / | \   |
   E '+' E '8'
   |     |
  '1'   '4'

5 Conclusion

5.1 Summary

  • Context free grammars define languages.

  • CFG is strictly more powerful than RE.

  • Derivations are proofs that a string is accepted by the grammar.

  • Derivations can be expressed as parse trees.

5.2 What’s to come?

  • Ambiguity in programming languages

  • Grammar rewrites

  • More grammar transformations