# Learning Compiler 5

## Note

### The Functionality of the Parser

**Input**: sequence of tokens from lexer**Output**: parse tree of the program

### Comparison with Lexical Analysis

Phase | Input | Output |
---|---|---|

Lexer | String of characters | String of tokens |

Parser | String of token | Parse tree |

### The Role of the Parser

- Not all strings of tokens are programs
- Parser must distinguish between valid and invalid strings of tokens

So we need:

- A language for describing valid strings of tokens
- A method for distinguishing valid from invalid strings of tokens

### Context-Free Grammars(CFG)

- Programming language constructs have recursive structure
- Context-free grammars are a natural notation for the recursive structure

A CFG consists of:

- A set of terminals $T$
- A set of non-terminals $N$
- A start symbol $S$ (a non-terminal)
- A set of productions $X \rightarrow Y_1Y_2\dots Y_n$, where $X \in N$ and $Y_i \in T \cup N \cup \{\epsilon\}$.

### Examples of CFGs

E \rightarrow E * E

| E + E

| (E)

| id

### Key Idea

- Begin with a string consisting of the start symbol “S”
- Replace any non-terminal $X$ in the string by a the right-hand side of some production $X \rightarrow Y_1Y_2\dots Y_n$
- Repeat (2) until there are no non-terminals in the string

### The Language of a CFG

Read productions as rules:

$X \rightarrow Y_1Y_2\dots Y_n$means $X$ can be replaced by $Y_1Y_2\dots Y_n$

More formally, write

$X_1\dots X_{i-1}X_i X_{i+1}\dots X_n \rightarrow X_1\dots X_{i-1} Y_1\dots Y_m X_{i+1}\dots X_n$if there is a production

$X_i \rightarrow Y_1Y_2\dots Y_m$Write

$X_1\dots X_n \rightarrow *Y_1\dots Y_m$if

$X_1\dots X_n \rightarrow \dots \rightarrow \dots \rightarrow Y_1\dots Y_m$in 0 or more steps

Let $G$ be a context-free grammer with start symbol $S$. Then the language of $G$ is

$\{a_1\dots a_n | S \rightarrow *a_1\dots a_n \text{and every} a_i \text{is a terminal}\}$### Terminals

- Terminals are so-called because there are no rules for replacing them
- Once generated, terminals are permanent
- Terminals ought to be tokens of the language

### Notes

The idea of a CFG is a big step. But:

- Membership in a language is “yes” or “no”
- We also need a parse tree of the input

- Must handle errors gracefully
- Need an implementation of CFG’s (e.g. bison)
- Form of the grammar is important
- Many grammars generate the same language
- Tools are sensitive to the grammar

(Tools for regular languages (e.g., flex) are sensitive to the form of the regular expression, but this is rarely a problem in practice)

### Derivations and Parse Trees

A derivation is a sequence of productions

$S \rightarrow \dots \rightarrow \dots \dots$A derivation can be drawn as a tree

- Start symbol is the tree’s root
- For a production $X \rightarrow Y_1 \dots Y_n$ add children $Y_1 \dots Y_n$ to node $X$

Letf-most and right-most derivations have the same parse tree, the difference is the order in which branches are added.

### Summary of Derivations

We are not just interested in whether s c L(G)

- We need a parse tree for $s \in L(G)$
- A derivation defines a parse tree

- But one parse tree may have many derivations
- Left-most and right-most derivations are important in parser implementation

### Ambiguity

- A grammar is ambiguous if it has more than one parse tree for some string
- Equivalently, there is more than one right-most or left-most derivation for some string

- Ambiguity is BAD
- Leaves meaning of some programs ill-defined

### Dealing with Ambiguity

- There are several ways to handle ambiguity
- Most direct method is to rewrite grammar unambiguously
- Enforces precedence of * over +

cases:

Ambiguity in Arithmetic Expressions: E -> E + E | E * E | ( E ) | int

The Dangling Else: E -> if E then E | if E then E else E | OTHER

- Fix: else matches the closest unmatched then

No general techniques for handling ambiguity

Impossible to convert automatically an ambiguous grammar to an unambiguous one

- Used with care, ambiguity can simplify the grammar
- Sometimes allows more natural definitions – We need disambiguation mechanisms

### Precedence and Associativity Declarations

- Instead of rewriting the grammar
- Use the more natural (ambiguous) grammar
- Along with disambiguating declarations

- Most tools allow precedence and associativity declarations to disambiguate grammars
- Examples:
- Left associativity declaration: %left +
- Precedence declarations: %left +, %left *

- Examples:

## Reference

Slides: Introduction to Parsing

Video: Compiler Stanford 2014