Introduction to Context Free Grammar
1 Context Free Grammar
1.1 Symbols
We distinguish the alphabet into two types of symbols:
- Terminals: \(\Sigma_0\) - they are the tokens.
- Nonterminals: \(\Sigma_1\) - they are also called syntactic variables.
The alphabet is \(\Sigma = \Sigma_0\cup\Sigma_1\).
1.2 Start symbol
There is a special syntactic variable \(S\in\Sigma_1\) referred to ask the start symbol.
1.3 Production rules
A production is of the form:
\[ A:\alpha \]
where:
- \(A\in\Sigma_1\) - this is called the head of the production.
- \(\alpha\in(\Sigma_0\cup\Sigma_1)^*\) - this is called the body of the production.
1.4 Context free grammar (CFG)
A CFG is a collection of productions and a start symbol.
\[ G = \left<S, \{A_i:\alpha_i\}_{i\leq n}\right> \]
where
- \(S\in\Sigma_1\) is the start symbol.
- \(A_i\in\Sigma_1\) are the head of the productions.
- \(\alpha_i\in\Sigma^*\) are the bodies of the productions.
Note: \(\alpha_i\) can be empty \(\epsilon\) for some \(i\).
Example
\[ \begin{eqnarray} E &:& ab \\ E &:& aEb \end{eqnarray} \]
Here, there is one syntactic variable \(E\), which is also the starting symbol. The two productions are:
- \(E: ab\), where the head is \(E\) and the body \(ab\).
- \(E: aEb\), where the head is \(E\), and the body \(aEb\).
Notation
When multiple productions share the same head, we call them alternations of the head symbol. We can group the alternations using the following notation:
\[ \begin{eqnarray} E &:& ab\ |\ aEb \end{eqnarray} \]
2 Derivation
2.1 Intuition
- Productions are describing how the head can be replaced by its body.
- Given a string over both syntactic variables and terminals, the objective is to replace syntactic variables using some productions.
2.2 Derivation
Given a string \(x=x_1 x_2 \dots x_n\) where \(x_i\in\Sigma_1\cup\Sigma_0\). We can perform derivation using a production of the form: \[ p_i = x_i:\alpha \] by replacing symbol \(x_i\) in \(x\) with the body \(\alpha\).
\[ x \underset{p,i}{\longrightarrow} x_1x_2\dots x_{i-1}{\color{red}\alpha} x_{i+1} x_{i+2}\dots x_n \]
2.3 An example
\(E \underset{E:aEb}{\longrightarrow} {\color{red}aEb}\)
\(aEb \underset{E:ab}{\longrightarrow} a{\color{red}ab}b\)
\(aEb \underset{E:aEb}{\longrightarrow} a{\color{red}aEb}b\)
2.4 More on derivation
We generalize the one step derivation \(\rightarrow\) to more steps.
- \(x\Rightarrow x\)
- If \(x\Rightarrow y\) and \(y\underset{p,i}{\longrightarrow}z\), then \(x\Rightarrow z\).
Therefore, in the example above, we have:
\[ E\Rightarrow aaEbb \]
3 Defining a language using CFG
Consider a string \(\alpha\in(\Sigma_0\cup\Sigma_1)^*\), and some CFG with start symbol \(S\).
\(\alpha\) is a sentential form of \(S\) if \(S\Rightarrow\alpha\).
\(\alpha\) is a sentence of \(S\) if it is a sentential form of \(S\) and only contains terminal symbols: \[S\Rightarrow\alpha\ \mathrm{and}\ \alpha\in\Sigma_0^*\]
\(L(S)\) is all the sentences of \(S\).
Example
\[L(E) = \{a^n b^n: n\geq 1\}\]
Thus, \(L(E)\) is not regular.
4 Parse Trees
Grammar is the syntax
Strings are the programs
Parse trees are the interpretations
4.1 Example
Consider the grammar \(G\) given as:
\[ \begin{eqnarray} S &:& aSSb \\ &|& 0 \\ &|& 1 \end{eqnarray} \]
It’s easy to see that \(a01b\in L(S)\). The derivation is:
\[ \begin{eqnarray} && S \\ &\Rightarrow& aSSb \\ &\Rightarrow& a0Sb \\ &\Rightarrow& a01b \end{eqnarray} \]
We can represent the productions applied as a tree.
S|
+----+--+--+----+
| | | |
| S S |
| | | |
'a' '0' '1' 'b'
- The root of the tree is the start symbol.
- The intermediate nodes are the syntactic variables.
- The leaf nodes are the terminals (aka tokens).
- Each parent node is a head of the production, and its children are the body of the production.
4.2 Case study: arithmetics
Let’s consider a grammar that defines the language for arithmetic expressions over tokens of type Number
with +
and *
operators.
Here are some examples:
0
1 + 4 * 8
8 * 3 + 5
8 * 3 + 5 + 8 * 3 + 5 + 8 * 3 + 5
- The digits will converted into tokens
Number
by the lexical analyzer. - Thus, the grammar will only need to consider tokens of type
Number
instead of the lexemes.
We can define the grammar as follows:
Tokens
\(\Sigma_0 = \{{\color{red}\mathrm{+}}, {\color{red}\mathrm{*}}, {\color{red}\mathrm{Number}}\}\)
Productions
\[ \begin{eqnarray} E &:& \mathrm{Number} \\ &|& E + E \\ &|& E * E \end{eqnarray} \]
Derivation of 1 + 4 * 8
:
\[ \begin{eqnarray} E &\to& E + E \\ &\to& 1:\mathrm{Number} + E \\ &\to& 1:\mathrm{Number} + E * E \\ &\to& 1:\mathrm{Number} + 4:\mathrm{Number} * E \\ &\to& 1:\mathrm{Number} + 4:\mathrm{Number} * 8:\mathrm{Number} \\ \end{eqnarray} \]
A Preview of ambiguity
Consider 1 + 4 * 8
. Do we add then multiply, or multiply then add?
Let’s examine the parse tree:
\[ \begin{eqnarray} E &\to& E + E \\ &\to& 1 + E \\ &\to& 1 + E * E \\ &\to& 1 + 4 * E \\ &\to& 1 + 4 * 8 \\ \end{eqnarray} \]
E/ | \
'+' E
E | / | \
'1' E '*' E
| |
'4' '8'
But this is not the only possible derivation.
\[ \begin{eqnarray} E &\to& {\color{red}E * E} \\ &\to& {\color{red}E + E} * E \\ &\to& {\color{red}1} + E * E \\ &\to& 1 + {\color{red}4} * E \\ &\to& 1 + 4 * {\color{red}8} \\ \end{eqnarray} \]
E/ | \
'*' E
E / | \ |
'+' E '8'
E | |
'1' '4'
5 Conclusion
5.1 Summary
Context free grammars define languages.
CFG is strictly more powerful than RE.
Derivations are proofs that a string is accepted by the grammar.
Derivations can be expressed as parse trees.
5.2 What’s to come?
Ambiguity in programming languages
Grammar rewrites
More grammar transformations