CS362 2004S : Class 20: Shift-Reduce Parsing (1)

Held: Friday, 5 March 2004

Summary: Today we consider an alternate form of parsing, shift-reduce parsing. Shift-reduce parsing can be applied to many more grammars than can predictive parsing.

Related Pages:

Notes:

Any thoughts on yesterday's lab?

Any questions on phase 2?

Overview:

Introduction to shift-reduce parsing

A shift-reduce parser for a non-predictive language

A shift-reduce parser for a simple expression language

LR(0) parsers: Basic shift-reduce parsers

Shift-Reduce Parsing

Note also that recursive descent parsers are, in effect, top-down (you start with the start symbol and attempt to derive the string).

We can gain some power by starting at the bottom and working our way up.

The most common bottom-up parsers are the so-called shift-reduce parsers. These parsers examine the input tokens and either shift them onto a stack or reduce the top of the stack, replacing a right-hand side by a left-hand side.

Think about the sample grammar for aⁿbⁿ|aⁿcⁿ, which we write as

s ::= b
   |  c
b ::= A b B
b ::= epsilon
c ::= A c C
c ::= epsilon

Here's a procedure for matching the language

If you see an A with nothing on the stack, shift A onto the stack.
If you see an A with A on the stack, shift A onto the stack.
If you see a B with an A on top of the stack, reduce the epsilon on top of the stack to b (which goes back on the stack)
If you see a B with [A b] on top of the stack, shift the B onto the stack and then replace the top three elements of the stack by b.
If you see a C with an A on top of the stack, reduce the epsilon on top of the stack to c (which goes back on the stack)
If you see a c with [A c] on top of the stack, shift the C onto the stack and then replace the top three elements of the stack by c.
If you hit the end of the string with b on top of the stack, reduce it to s.
If you hit the end of the string with c on top of the stack, reduce it to s.
If you hit the end of the string with s the only value on the stack, accept.
In any other case, crash and burn.

Here's a set of helpful rules for expressions

(input: NUM, stack: anything) -> shift
(input: ADDOP, stack: NUM) -> reduce NUM to factor
(input: ADDOP, stack: term MULOP factor) -> reduce term MULOP factor to term
(input: ADDOP, stack: factor) -> reduce factor to term
(input: ADDOP, stack: expression ADDOP term) -> reduce expression ADDOP term to term
(input: ADDOP, stack: term) -> reduce term to expression
(input: ADDOP, stack: expression) -> shift
(input: MULOP, stack: NUM) -> reduce NUM to factor
(input: MULOP, stack: term MULOP factor) -> reduce term mulop factor to term
(input: MULOP, stack: factor) -> reduce factor to term
(input: MULOP, stack: term) -> shift
There's some related stuff for end-of-input not shown.

Here's an attempt to parse 1 + 2 + 3 * 4 + 5 with that set of rules

Starting state
- Input: [1 + 2 + 3 * 4 + 5]
- Stack: []
We shift 1 onto the stack:
- Input: [+ 2 + 3 * 4 + 5]
- Stack: [NUM(1)]
We reduce the 1 on the stack to a factor
- Input: [+ 2 + 3 * 4 + 5]
- Stack: [factor(1)]
We reduce the factor on the stack to a term
- Input: [+ 2 + 3 * 4 + 5]
- Stack: [term(1)]
We reduce the term on the stack to an expression
- Input: [+ 2 + 3 * 4 + 5]
- Stack: [expression(1)]
We shift the + onto the stack
- Input: [2 + 3 * 4 + 5]
- Stack: [expression(1) ADDOP(+)]
We shift 2 onto the stack:
- Input: [+ 3 * 4 + 5]
- Stack: [expression(1) ADDOP(+) NUM(2)]
We reduce the 2 on the stack to a factor:
- Input: [+ 3 * 4 + 5]
- Stack: [expression(1) ADDOP(+) factor(2)]
We reduce the factor on the stack to a term:
- Input: [+ 3 * 4 + 5]
- Stack: [expression(1) ADDOP(+) term(2)]
We reduce the expression ADDOP term on the stack to an expression:
- Input: [+ 3 * 4 + 5]
- Stack: [expression(1+2)]
We shift the plus onto the stack
- Input: [3 * 4 + 5]
- Stack: [expression(1+2) ADDOP(+)]
We shift the 3 onto the stack:
- Input: [* 4 + 5]
- Stack: [expression(1+2) ADDOP(+) NUM(3)]
We reduce the 3 to a factor:
- Input: [* 4 + 5]
- Stack: [expression(1+2) ADDOP(+) factor(3)]
We reduce the Factor to a Term:
- Input: [* 4 + 5]
- Stack: [expression(1+2) ADDOP(+) term(3)]
We shift the multiplication symbol onto the stack:
- Input: [4 + 5]
- Stack: [expression(1+2) ADDOP(+) term(3) MULOP(*)]
We shift the 4 onto the stack:
- Input: [+ 5]
- Stack: [expression(1+2) ADDOP(+) term(3) MULOP(*) NUM(4)]
We reduce the 4 to a factor:
- Input: [+ 5]
- Stack: [expression(1+2) ADDOP(+) term(3) MULOP(*) factor(4)]
We reduce the term MULOP factor to a term:
- Input: [+ 5]
- Stack: [expression(1+2) ADDOP(+) term(3*4)]
We reduce the expression PLUS term to an Expression:
- Input: [+ 5]
- Stack: [expression((1+2)+(3*4))]
We shift the plus onto the stack:
- Input: []
- Stack: [expression((1+2)+(3*4)) ADDOP(+)]
We shift the number onto the stack
- Input: []
- Stack: [expression((1+2)+(3*4)) ADDOP(+) NUM(5)]
We reduce the 5 to a Factor:
- Input: []
- Stack: [expression((1+2)+(3*4)) ADDOP(+) factor(5)]
We reduce the Factor to a Term:
- Input: []
- Stack: [expression((1+2)+(3*4)) ADDOP(+) term(5)]
We reduce the espression ADDOP term to an expression:
- Input: []
- Stack: [expression(((1+2)+(3*4))+5)]
We reduce the expression to the start symbol.

Note that it seems that we made different decisions in the same context. But the context is not the same. Decisions are based on what is on the stack and what the next input tokens are.

In some ways, these parsers can be viewed as finite automata that use a stack (also known as push-down automata).

Shift reduce parsing is traditionally done with LR(k) parsers. The first L stands for left-to-right traversal of the input, the next R stands for rightmost derivation and the k stands for number of characters of lookahead.

You should be able to understand the traversal and lookahead.
The rightmost refers to the types of derivations that the grammar represents. In a rightmost top-down derivation from the start symbol, you always replace the rightmost nonterminal.
While LR parsers are bottom up, they simulate rightmost top-down derivations

LR(0) Automata

The simplest LR parsers are based on LR automata with no lookahead.

Wait! You may say, How can we build a parser with no lookahead?
It turns out the the 0 refers to the lookahead used in constructing the automata, not (in this case) to the lookahead in running them.

The states of all LR parsers are sets of extended productions (also called items), representing the states of all possible parses of the input (more or less)

Each production is extended with a position (where in the parse we may be)
Each production is may be extended with lookahead symbols (none in LR(0) automata).

We begin building an LR(0) parser by augmenting the grammar with a simple rule in which we add an end of input ($) to the start symbol.

S' ::= S $

The initial state of an LR(0) parser begins with

S' ::= . S $

This means when we begin parsing, we are ready to match an S and the end-of-input symbol

We then augment this with the other things that we might match on our way to building an S and end of input. What are those things? Anything that might build us an S. What have we seen of those things? Nothing, yet.

For our expression grammar, we might write

S' ::= . E $
E ::= . E + T
E ::= . E - T
E ::= . T

Of course, in this case, if we're waiting to see a T, then we also need to add the T rules (and then the F rules).

S' ::= . E $
E ::= . E + T
E ::= . E - T
E ::= . T
T ::= . T mulop F
T ::= . F
F ::= . id
F ::= . num
F ::= . ( E )

We make new states in the automaton by choosing a symbol (terminal or nonterminal) and advancing the here mark (period) over that symbol in all rules, and then filling in the rest.

For example, if we see an E in state 0, we could be

S' ::= E . $
E ::= E . + T
E ::= E . - T

If we see a plus sign in that state, we could only be making progress on one rule, giving us

E ::= E + . T

But now, we're ready to see a T, so we need to fill in all the items that say ready to see a T

E ::= E + . T
T ::= . T mulop F
T ::= . F
F ::= . id
F ::= . num
F ::= . ( E )

If we do indeed see a T, we advance the here mark and get to

E ::= E + T .
T ::= T . mulop F

The first part suggests that we may be at the end of an E. The second suggests that we may be in the midst of a T. How do we decide which it is? By context (and a little lookahead in some cases).

[Skip to Body]
Primary: [Front Door] [Current] [Glance] - [Honesty] [Instructions] [Links] [Search]
Groupings: [EBoards] [Examples] [Exams] [Handouts] [Homework] [Labs] [Outlines] [Project] [Readings] [Reference]
Misc: [2001S] [2002F] [SamR]

Disclaimer: I usually create these pages on the fly, which means that I rarely proofread them and they may contain bad grammar and incorrect details. It also means that I tend to update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This document was generated by Siteweaver on Wed May 5 11:47:05 2004.
The source to the document was last modified on Tue Jan 20 23:06:46 2004.
This document may be found at http://www.cs.grinnell.edu/~rebelsky/Courses/CS362/2004S/Outlines/outline.20.html.

You may wish to validate this document's HTML ; ; Check with Bobby

Samuel A. Rebelsky, rebelsky@grinnell.edu