# Monadic Parsing

Associated Reading: RWH chapters 10 and 16

```
module Parsers where
import Data.Char
import Control.Monad
```

## What is a Parser?

A parser is a piece of software that takes a raw `String`

(or sequence of bytes) and returns some structured object, for example, a list of options, an XML tree or JSON object, a program's Abstract Syntax Tree and so on. Parsing is one of the most basic computational tasks. *Every* serious software system has a parser tucked away somewhere inside, for example

- Shell Scripts (command-line options)
- Web Browsers (duh!)
- Games (level descriptors)
- Routers (packets)
- etc

(Indeed I defy you to find any serious system that *does not* do some parsing somewhere!)

The simplest and most accurate way to think of a parser is as a function

`type Parser = String -> StructuredObject`

## Composing Parsers

The usual way to build a parser is by specifying a grammar and using a parser generator (eg yacc, bison, antlr) to create the actual parsing function. While elegant, one major limitation of the grammar based approach is its lack of modularity. For example, suppose I have two kinds of primitive values `Thingy`

and `Whatsit`

.

```
Thingy : rule { action }
;
Whatsit : rule { action }
;
```

If you want a parser for *sequences of* `Thingy`

and `Whatsit`

we have to painstakingly duplicate the rules as

```
Thingies : Thingy Thingies { ... }
EmptyThingy { ... }
;
Whatsits : Whatsit Whatsits { ... }
EmptyWhatsit { ... }
;
```

This makes sub-parsers hard to reuse. Next, we will see how to *compose* mini-parsers for sub-values to get bigger parsers for complex values.

To do so, we will generalize the above parser type a little bit, by noting that a (sub-)parser need not (indeed, will not) consume *all* of its input, and so we can simply have the parser return the unconsumed input

`type Parser = String -> (StructuredObject, String) `

Of course, it would be silly to have different types for parsers for different kinds of objects, and so we can make it a parameterized type.

`type Parser a = String -> (a, String) `

One last generalization: the parser could return multiple results, for example, we may want to parse the string

`"2 + 3 + 4"`

either as

`Plus (Plus 2 3) 4`

or as

`Plus 2 (Plus 3 4)`

So, we can have our parsers return a *list* of possible results, where the empty list corresponds to a failure to parse.

`type Parser a = String -> [(a, String)]`

As the last step, we will wrap this type definition up in a `newtype`

, which is like a datatype that has only one constructor. This will make sure that we keep parsers distinct from other values of this type (and will allow us to make class instances).

`newtype Parser a = P (String -> [(a, String)])`

The above is simply the parser, which we will build up compositionally. The actual parsing is done when we use the parser and apply it to the input stream.

```
doParse :: Parser a -> String -> [(a,String)]
doParse (P p) s = p s
```

Let's build some parsers!

## Parse A Single character

Here's a *very* simple character parser, that returns the first `Char`

from a list if one exists

```
oneChar :: Parser Char
oneChar = P undefined
```

Let's run the parser

```
ghci> doParse oneChar "hey!"
ghci> doParse oneChar ""
```

## Parser Composition

Now we can write another parser that grabs a pair of `Char`

values

```
twoChar0 :: Parser (Char, Char)
twoChar0 = P undefined
```

Or, it is more general to write a *combinator* that takes two parsers and returns a new parser that returns a pair of values

```
pairP0 :: Parser a -> Parser b -> Parser (a,b)
pairP0 p1 p2 = P undefined
```

and use that to rewrite 'twoChar' more elegantly as

`twoChar = pairP0 oneChar oneChar `

which would run like this

```
ghci> doParse twoChar "hey!"
ghci> doParse twoChar ""
```

Now we could keep doing this, but often to go forward, it is helpful to step back and take a look at the bigger picture. Here's the *type* of a parser

`newtype Parser a = P (String -> [(a, String)])`

it should remind you of something else, remember this?

`newtype State s a = S (s -> (a, s))`

# Parser is A Monad

Indeed, a parser, like a state transformer, is a monad! if you squint just the right way. We need to define the `return`

and `>>=`

functions.

The first is very simple, we can let the types guide us. We must ignore the input string and just return the input element.

```
returnP :: a -> Parser a
returnP x = P undefined
```

The bind is a bit more tricky, but again, let's lean on the types:

```
bindP :: Parser a -> (a -> Parser b) -> Parser b
p1 `bindP` fp2 = P undefined
```

So, we need to suck the `a`

values out of the first parser and invoke the second parser with them on the remaining part of the string.

Armed with those, we can officially brand parsers as monads

```
instance Monad Parser where
(>>=) = bindP
return = returnP
```

# Parser Combinators

Since parsers are monads, we can write a bunch of high-level combinators for composing smaller parsers into bigger ones.

For example, we can use our beloved `do`

notation to rewrite the `pairP`

as

```
pairP :: Parser a -> Parser b -> Parser (a,b)
pairP p1 p2 = do x <- p1
y <- p2
return (x, y)
```

shockingly, exactly like the `pairs`

function we have seen before.

Next, let's flex our monadic parsing muscles and write some new parsers. It will be helpful to have a *failure* parser that always goes down in flames (returns `[]`

)

```
failP :: Parser a
failP = P undefined
```

Seems a little silly to write the above, but it's helpful to build up richer parsers like the following which parses a `Char`

*if* it satisfies a predicate `p`

```
satP :: (Char -> Bool) -> Parser Char
satP p = undefined
```

Note that we are working abstractly here, we've defined satP without using the "P" data constructor or "doParse". We are building up parsers programmatically, from smaller components.

With this, we can write some simple parsers for particular characters. The following parse alphabet and numeric characters respectively

```
alphaChar, digitChar :: Parser Char
alphaChar = satP isAlpha
digitChar = satP isDigit
```

```
ghci> doParse alphaChar "123"
ghci> doParse digitChar "123"
```

and this little fellow returns the first digit in a string as an `Int`

```
digitInt :: Parser Int
digitInt = do
c <- digitChar
return $ ord c - ord '0'
```

which works like so

```
ghci> doParse digitInt "92"
ghci> doParse digitInt "cat"
```

Finally, this parser will parse only a particular `Char`

passed in as input

```
char :: Char -> Parser Char
char c = undefined
```

`ghci> doParse (char 'a') "ab" `

## A Nondeterministic Choice Combinator

Next, let's write a combinator that takes two sub-parsers and nondeterministically chooses between them.

`chooseP :: Parser a -> Parser a -> Parser a`

How would we go about encoding *choice* in our parsers? Well, we want to return a succesful parse if *either* parser succeeds. Since our parsers return multiple values, we can simply return the *union* of all the results!

`p1 `chooseP` p2 = P undefined`

We can use the above combinator to build a parser that returns either an alphabet or a numeric character

`alphaNumChar = alphaChar `chooseP` digitChar`

When we run the above we get some rather interesting results

```
ghci> doParse alphaNumChar "cat"
ghci> doParse alphaNumChar "2cat"
ghci> doParse alphaNumChar "2at"
```

What is even nicer is that if *both* parsers succeed, you end up with all the results. For example, here's a parser that grabs `n`

characters from the input

```
grabn :: Int -> Parser String
grabn n | n <= 0 = return ""
| otherwise = do c <- oneChar
cs <- grabn (n-1)
return (c:cs)
```

(Challenge: can you remove the recursion from that?)

Now, we can use our choice combinator

`grab2or4 = grabn 2 `chooseP` grabn 4`

and now, we will get back *both* results if possible

`ghci> doParse grab2or4 "mickeymouse"`

and only one result if that's possible

```
ghci> doParse grab2or4 "mic"
ghci> doParse grab2or4 "m"
```

Even with the rudimentary parsers we have at our disposal, we can start doing some rather interesting things. For example, here is a little calculator. First, we parse the operation

```
intOp = plus `chooseP` minus `chooseP` times `chooseP` divide
where plus = char '+' >> return (+)
minus = char '-' >> return (-)
times = char '*' >> return (*)
divide = char '/' >> return div
```

(can you guess the type of the above parser?) And then we parse the expression

```
calc :: Parser Int
calc = do x <- digitInt
op <- intOp
y <- digitInt
return $ x `op` y
```

which, when run, will both parse and calculate

```
ghci> doParse calc "8/2"
ghci> doParse calc "8+2cat"
ghci> doParse calc "8/2cat"
ghci> doParse calc "8-2cat"
ghci> doParse calc "8*2cat"
```

## Recursive Parsing

To start parsing interesting things, we need to add recursion to our combinators. For example, it's all very well to parse individual characters (as in `char`

above) but it would a lot more swell if we could grab particular `String`

tokens.

Let's try to write it!

`string :: String -> Parser String`

```
string "" = return ""
string (c:cs) = do char c
string cs
return $ c:cs
```

Ewww! Is that explicit recursion ?! Let's try again (can you spot the pattern)

`string = undefined`

Much better!

```
ghci> doParse (string "mic") "mickeyMouse"
ghci> doParse (string "mic") "donald duck"
```

Ok, I guess that wasn't really recursive then after all! Let's try again.

## Parsing a list of Input

Let's write a combinator that takes a parser `p`

that returns an `a`

and returns a parser that returns *many* `a`

values. That is, it keeps grabbing as many `a`

values as it can and returns them as `[a]`

.

i.e. we want a parser that parses one thing, then calls itself recursively or always succeeds (consuming no input).

```
manyP :: Parser a -> Parser [a]
manyP p = liftM2 (:) p (manyP p) `chooseP` return []
```

```
{-
cons = do h <- p
tl <- (manyP p)
return (h : tl)
-}
```

Beware, the above can yield *many* results

`ghci> doParse (manyP digitInt) "123a" `

which is simply all the possible ways to extract sequences of integers at the beginning of an input string.

## Deterministic Maximal Parsing

Often we want a single result, not a set of results. For example, the more intuitive behavior of `many`

would be to return the maximal sequence of elements and not *all* the prefixes.

To do so, we need a *deterministic* choice combinator. This combinator returns no more than one result (i.e. it runs the choice parser but discards extra results).

```
(<|>) :: Parser a -> Parser a -> Parser a
p1 <|> p2 = P undefined
```

Now, we can revisit the `manyP`

combinator and ensure that it returns a single, maximal sequence.

```
mmanyP :: Parser a -> Parser [a]
mmanyP p = undefined
```

`ghci> doParse (mmanyP digitInt) "123a" `

Let's use the above to write a parser that will return an entire natural number (not just a single digit.)

```
oneNat0 :: Parser Int
oneNat0 = do xs <- mmanyP digitChar
return $ ((read xs) :: Int)
```

*Aside*, can you spot the pattern above? We took the parser `mmanyP digitChar`

and simply converted its output using the `read`

function. This is a recurring theme, and the type of what we did gives us a clue

`(a -> b) -> Parser a -> Parser b`

Aha! a lot like `map`

. Indeed, there is a generalized version of `map`

that we have seen before (`liftM`

) and we bottle up the pattern by declaring `Parser`

to be an instance of the `Functor`

typeclass

```
instance Functor Parser where
fmap = undefined
```

after which we can rewrite

```
oneNat1 :: Parser Int
oneNat1 = read `fmap` mmanyP digitChar
```

Let's take it for a spin

```
ghci> doParse oneNat1 "123a"
ghci> doParse oneNat1 ""
```

One more twist, let's make sure that there is at least one digit before we do the read.

```
many1 :: Parser a -> Parser [a]
many1 p = undefined
```

```
oneNat :: Parser Int
oneNat = read `fmap` many1 digitChar
```

# Parsing XML

For a short digression, let's see a quick example that puts together what we know about parsers. In 15 mins, we'll build a parser for simple XML expressions.

See Xml.lhs

# Parsing Arithmetic Expressions

Now let's use the above to build a small calculator, that parses and evaluates arithmetic expressions. In essence, an expression is either binary operand applied to two sub-expressions or an integer. We can state this as

```
calc1 :: Parser Int
calc1 = binExp <|> oneNat
where binExp = do x <- oneNat
o <- intOp
y <- calc1
return $ x `o` y
```

This works pretty well!

```
ghci> doParse calc1 "1+2+33"
ghci> doParse calc1 "11+22-33"
```

but things get a bit strange with minus

`ghci> doParse calc1 "11+22-33+45"`

Huh? Well, if you look back at the code, you'll realize the above was parsed as

`11 + ( 22 - (33 + 45))`

because in each `binExp`

we require the left operand to be an integer. In other words, we are assuming that each operator is *right associative* hence the above result. Making this parser left associative is harder than it looks---we can't just swap 'oneNat' and 'calc1' above. (Why not?)

Even worse, we have no precedence, and so

`ghci> doParse calc1 "10*2+100"`

as the string is parsed as

`10 * (2 + 100)`

## Precedence

We can add both left associativity and precedence by stratifying the parser into different levels. Here, let's split our operations into addition- and multiplication-precedence.

```
addOp :: Parser (Int -> Int -> Int)
addOp = plus `chooseP` minus
where plus = char '+' >> return (+)
minus = char '-' >> return (-)
mulOp :: Parser (Int -> Int -> Int)
mulOp = times `chooseP` divide
where times = char '*' >> return (*)
divide = char '/' >> return div
```

Now, we can stratify our language into (mutually recursive) sub-languages, where each top-level expression is parsed as a *sum-of-products*

```
sumE :: Parser Int
sumE = addE <|> prodE
where addE = do x <- prodE
o <- addOp
y <- sumE
return $ x `o` y
prodE :: Parser Int
prodE = mulE <|> factorE
where mulE = do x <- factorE
o <- mulOp
y <- prodE
return $ x `o` y
factorE :: Parser Int
factorE = parenE <|> oneNat
where parenE = do char '('
n <- sumE
char ')'
return n
```

We can run this

```
ghci> doParse sumE "10*2+100"
ghci> doParse sumE "10*(2+100)"
```

Do you understand why the first parse returned `120`

? What would happen if we *swapped* the order of `prodE`

and `sumE`

in the body of `addE`

(or `factorE`

and `prodE`

in the body of `prodE`

) ? Why?

## Parsing Pattern: Chaining

There is not much point gloating about combinators if we are going to write code like the above -- the bodies of `sumE`

and `prodE`

are almost identical!

Let's take a closer look at them. In essence, a `sumE`

is of the form

`prodE + < prodE + < prodE + ... < prodE >>>`

that is, we keep chaining together `prodE`

values and adding them for as long as we can. Similarly a `prodE`

is of the form

`factorE * < factorE * < factorE * ... < factorE >>>`

where we keep chaining `factorE`

values and multiplying them for as long as we can. There is something unpleasant about the above: the addition operators are right-associative

`ghci> doParse sumE "10-1-1"`

Ugh! I hope you understand why: it's because the above was parsed as `10 - (1 - 1)`

(right associative) and not `(10 - 1) - 1`

(left associative). You might be tempted to fix that simply by flipping the order of `prodE`

and `sumE`

```
sumE = addE <|> prodE
where addE = do x <- sumE
o <- addOp
y <- prodE
return $ x `o` y
```

but this would prove disastrous. Can you see why? The parser for `sumE`

directly (recursively) calls itself *without consuming any input!* Thus, it goes off the deep end and never comes back. Instead, we want to make sure we keep consuming `prodE`

values and adding them up (rather like fold) and so we could do

```
sumE1 :: Parser Int
sumE1 = prodE1 >>= rest
where rest x = next x <|> return x
next x = do o <- addOp
y <- prodE1
rest $ x `o` y
prodE1 :: Parser Int
prodE1 = factorE1 >>= rest
where rest x = next x <|> return x
next x = do o <- mulOp
y <- factorE1
rest $ x `o` y
factorE1 :: Parser Int
factorE1 = parenE <|> oneNat
where parenE = do char '('
n <- sumE1
char ')'
return n
```

It is easy to check that the above is indeed left associative.

`ghci> doParse sumE1 "10-1-1"`

and it is also very easy to spot and bottle the chaining computation pattern: the only differences are the *base* parser (`prodE1`

vs `factorE1`

) and the binary operation (`addOp`

vs `mulOp`

). We simply make those parameters to our *chain-left* combinator

```
chainl :: Parser b -> Parser (b -> b -> b) -> Parser b
p `chainl` pop = p >>= rest
where rest x = next x <|> return x
next x = do o <- pop
y <- p
rest $ x `o` y
```

Similarly, we often want to parse bracketed expressions, so we can write a combinator

```
parenP :: Char -> Parser b -> Char -> Parser b
parenP l p r = do char l
x <- p
char r
return x
```

after which we can rewrite the grammar in three lines

```
sumE2 = prodE2 `chainl` addOp
prodE2 = factorE2 `chainl` mulOp
factorE2 = parenP '(' sumE2 ')' <|> oneNat
```

```
ghci> doParse sumE2 "10-1-1"
ghci> doParse sumE2 "10*2+1"
ghci> doParse sumE2 "10+2*1"
```

That concludes our (in-class) exploration of monadic parsing. This is merely the tip of the iceberg. Though parsing is a very old problem, and has been studied since the dawn of computing, we saw how monads bring a fresh perspective which have recently been transferred from Haskell to [many other languages][3]. There have been several exciting recent papers on the subject, that you can explore on your own. Finally, Haskell comes with several parser combinator libraries which you will play around with in your next homework assignment.

## News :

Welcome to CIS 552!

See the home page for basic
information about the course, the schedule for the lecture notes
and assignments, the resources for links to the required software
and online references, and the syllabus for detailed information about
the course policies.