Parser Combinators
43 min read
Learning Outcomes
- Understand that a parserA function or program that interprets structured input, often used to convert strings into data structures. is a program which extracts information from structured text
- Apply what we have learned about Haskell typeclasses and other functionalFunctional languages are built around the concept of composable functions. Such languages support higher order functions which can take other functions as arguments or return new functions as their result, following the rules of the Lambda Calculus. programming concepts to create solutions to real-world problems
- In particular, we learn to use parserA function or program that interprets structured input, often used to convert strings into data structures. combinatorsA higher-order function that uses only function application and earlier defined combinators to define a result from its arguments. and see how they are put together
Introduction
In this section we will see how the various Haskell language features we have explored allow us to solve real-world problems. In particular, we will develop a simple but powerful library for building parsersA function or program that interprets structured input, often used to convert strings into data structures. that is compositional through FunctorA type class in Haskell that represents types that can be mapped over. Instances of Functor must define the fmap function, which applies a function to every element in a structure. , ApplicativeA type class in Haskell that extends Functor, allowing functions that are within a context to be applied to values that are also within a context. Applicative defines the functions pure and (<*>). and MonadA type class in Haskell that represents computations as a series of steps. It provides the bind operation (»=) to chain operations and the return (or pure) function to inject values into the monadic context. interfacesA TypeScript construct that defines the shape of an object, specifying the types of its properties and methods. . Before this, though, we will learn the basics of parsing text, including a high-level understanding that parsersA function or program that interprets structured input, often used to convert strings into data structures. are state-machines which realise a context-free grammarA type of formal grammar that is used to define the syntax of programming languages and data formats. CFGs consist of a set of production rules that define how terminals and non-terminals can be combined to produce strings in the language. over a textual language.
Previously, we glimpsed a very simplistic ApplicativeA type class in Haskell that extends Functor, allowing functions that are within a context to be applied to values that are also within a context. Applicative defines the functions pure and (<*>). parserA function or program that interprets structured input, often used to convert strings into data structures. . In this chapter, a parserA function or program that interprets structured input, often used to convert strings into data structures. is still simply a function which takes a string as input and produces some structure or computation as output, but now we extend the parserA function or program that interprets structured input, often used to convert strings into data structures. with monadic “bindthe defining function which all monads must implement. ” definitions, richer error handling and the ability to handle non-trivial grammars with alternativeA type class in Haskell that extends Applicative, introducing the empty and <|> functions for representing computations that can fail or have multiple outcomes. inputs.
Parsing has a long history and parserA function or program that interprets structured input, often used to convert strings into data structures.
combinatorsA higher-order function that uses only function application and earlier defined combinators to define a result from its arguments.
are a relatively recent approach made popular by modern functionalFunctional languages are built around the concept of composable functions. Such languages support higher order functions which can take other functions as arguments or return new functions as their result, following the rules of the Lambda Calculus.
programming techniques.
A parserA function or program that interprets structured input, often used to convert strings into data structures.
combinatorA higher-order function that uses only function application and earlier defined combinators to define a result from its arguments.
is a higher-order functionA function that takes other functions as arguments or returns a function as its result.
that accepts parsersA function or program that interprets structured input, often used to convert strings into data structures.
as input and combines them somehow into a new parserA function or program that interprets structured input, often used to convert strings into data structures.
.
More traditional approaches to parsing typically involve special purpose programs called parserA function or program that interprets structured input, often used to convert strings into data structures. generators, which take as input a grammar defined in a special language (usually some derivation of BNF as described below) and generate the partial program in the desired programming language which must then be completed by the programmer to parse such input. ParserA function or program that interprets structured input, often used to convert strings into data structures. combinatorsA higher-order function that uses only function application and earlier defined combinators to define a result from its arguments. have the advantage that they are entirely written in the one language. ParserA function or program that interprets structured input, often used to convert strings into data structures. combinatorsA higher-order function that uses only function application and earlier defined combinators to define a result from its arguments. written in Haskell take advantage of the expressiveness of the Haskell language such that the finished parserA function or program that interprets structured input, often used to convert strings into data structures. can look a lot like a BNF grammar definition, as we shall see.
The parserA function or program that interprets structured input, often used to convert strings into data structures. combinatorA higher-order function that uses only function application and earlier defined combinators to define a result from its arguments. discussed here is based on one developed by Tony Morris and Mark Hibberd as part of their “System F” FunctionalFunctional languages are built around the concept of composable functions. Such languages support higher order functions which can take other functions as arguments or return new functions as their result, following the rules of the Lambda Calculus. Programming Course, which in turn is a simplified version of official Haskell parserA function or program that interprets structured input, often used to convert strings into data structures. combinatorsA higher-order function that uses only function application and earlier defined combinators to define a result from its arguments. such as parsec by Daan Leijen.
You can play with the example and the various parserA function or program that interprets structured input, often used to convert strings into data structures. bits and pieces in this on-line playground.
Context-free Grammars and BNF
Fundamental to analysis of human natural language but also to the design of programming languages is the idea of a grammar, or a set of rules for how elements of the language may be composed. A context-free grammarA type of formal grammar that is used to define the syntax of programming languages and data formats. CFGs consist of a set of production rules that define how terminals and non-terminals can be combined to produce strings in the language.
(CFG) is one in which the set of rules for what is produced for a given input (production rules) completely cover the set of possible input symbols (i.e. there is no additional context required to parse the input). Backus-Naur FormA notation for expressing context-free grammars. It is used to formally describe the syntax of programming languages.
(or BNF) is a notation that has become standard for writing CFGs since the 1960s. We will use BNF notation from now on. There are two types of symbols in a CFG: terminalIn the context of grammars, a terminal is a symbol that appears in the strings generated by the grammar. Terminals are the actual characters or tokens of the language.
and non-terminalA symbol in a grammar that can be replaced by a sequence of terminals and non-terminals according to the production rules of the grammar.
. In BNF non-terminalA symbol in a grammar that can be replaced by a sequence of terminals and non-terminals according to the production rules of the grammar.
symbols are <nameInsideAngleBrackets>
and can be converted into a mixture of terminalsIn the context of grammars, a terminal is a symbol that appears in the strings generated by the grammar. Terminals are the actual characters or tokens of the language.
and/or nonterminals by production rules:
<nonterminal> ::= a mixture of terminals and <nonterminal>s, alternatives separated by |
Thus, terminalsIn the context of grammars, a terminal is a symbol that appears in the strings generated by the grammar. Terminals are the actual characters or tokens of the language.
may only appear on the right-hand side of a production rule, non-terminalsA symbol in a grammar that can be replaced by a sequence of terminals and non-terminals according to the production rules of the grammar.
on either side. In BNF each non-terminalA symbol in a grammar that can be replaced by a sequence of terminals and non-terminals according to the production rules of the grammar.
symbol appears on the left-hand side of exactly one production rule, and there may be several possible alternativesA type class in Haskell that extends Applicative, introducing the empty and <|> functions for representing computations that can fail or have multiple outcomes.
for each non-terminalA symbol in a grammar that can be replaced by a sequence of terminals and non-terminals according to the production rules of the grammar.
specified on the right-hand side. These are separated by a |
(in this regard they look a bit like the syntaxThe set of rules that defines the combinations of symbols that are considered to be correctly structured statements or expressions in a computer language.
for algebraic data type definitions).
Note that production rules of the form above are for context-free grammarsA type of formal grammar that is used to define the syntax of programming languages and data formats. CFGs consist of a set of production rules that define how terminals and non-terminals can be combined to produce strings in the language. . As a definition by counter-example, context sensitive grammars allow terminalsIn the context of grammars, a terminal is a symbol that appears in the strings generated by the grammar. Terminals are the actual characters or tokens of the language. and more than one non-terminalA symbol in a grammar that can be replaced by a sequence of terminals and non-terminals according to the production rules of the grammar. on the left hand side.
Here’s an example BNF grammar for parsing Australian land-line phone numbers, which may optionally include a two-digit area code in brackets, and then two groups of four digits, with an arbitrary number of spaces separating each of these, e.g.:
(03) 9583 1762
9583 1762
Here’s the BNF grammar:
<phoneNumber> ::= <fullNumber> | <basicNumber>
<fullNumber> ::= <areaCode> <basicNumber>
<basicNumber> ::= <spaces> <fourDigits> <spaces> <fourDigits>
<fourDigits> ::= <digit> <digit> <digit> <digit>
<areaCode> ::= "(" <digit> <digit> ")"
<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<spaces> ::= " " <spaces> | ""
So "0"
-"9"
, "("
, ")"
, and " "
are the full set of terminalsIn the context of grammars, a terminal is a symbol that appears in the strings generated by the grammar. Terminals are the actual characters or tokens of the language.
.
Now here’s a sneak peak at a simple parserA function or program that interprets structured input, often used to convert strings into data structures.
for such phone numbers. It succeeds for any input string which satisfies the above grammar, returning a 10-digit string for the full number without spaces and assumes “03” for the area code for numbers with none specified (i.e. it assumes they are local to Victoria). Our Parser
type provides a function parse
which we call like so:
GHCi> parse phoneNumber "(02)9583 1762"
Result >< "0295831762"
GHCi> parse phoneNumber "9583 1762"
Result >< "0395831762"
GHCi> parse phoneNumber "9583-1762"
Unexpected character: "-"
We haven’t bothered to show the types for each of the functions in the code below, as they are all ::Parser [Char]
- meaning a ParserA function or program that interprets structured input, often used to convert strings into data structures.
that returns a string. We’ll explain all the types and functions used in due course. For now, just notice how similar the code is to the BNF grammar definition:
phoneNumber = fullNumber <|> (("03"++) <$> basicNumber)
fullNumber = do
ac <- areaCode
n <- basicNumber
pure (ac ++ n)
basicNumber = do
spaces
first <- fourDigits
spaces
second <- fourDigits
pure (first ++ second)
fourDigits = do
a <- digit
b <- digit
c <- digit
d <- digit
pure [a,b,c,d]
areaCode = do
is '('
a <- digit
b <- digit
is ')'
pure [a,b]
Parser Type
In essence, our parserA function or program that interprets structured input, often used to convert strings into data structures. is going to be summed up by a couple of types:
type Input = String
newtype Parser a = P { parse :: Input -> ParseResult a}
We assume all Input
is a String
, i.e. Haskell’s basic builtin String
which is a list of Char
.
Then the Parser
type has one field parse
which is a function of type Input -> ParseResult a
. So it parses strings and produces parse results, where a Parse result is:
data ParseResult a = Error ParseError
| Result Input a
deriving Eq
We’ll come back to the ParseError
type - which will be returned in the case of unexpected input, but we can see that a successful Parse is going to produce a Result
which has two fields—more Input
(the part of the input remaining after we took a bit off and parsed it), and an a
, a type parameterA placeholder for a type that is specified when a generic function or class is used, allowing for type-safe but flexible code.
that we may specify for concrete Parser
instances.
The Parser
and the ParseResult
types are pretty abstract. They say nothing about what precise Input
string we are going to parse, or what type a
we are going to return in the result. This is the strength of the parserA function or program that interprets structured input, often used to convert strings into data structures.
, allowing us to build up sophisticated parsersA function or program that interprets structured input, often used to convert strings into data structures.
for different input grammars through composition using instances of FunctorA type class in Haskell that represents types that can be mapped over. Instances of Functor must define the fmap function, which applies a function to every element in a structure.
, ApplicativeA type class in Haskell that extends Functor, allowing functions that are within a context to be applied to values that are also within a context. Applicative defines the functions pure and (<*>).
and MonadA type class in Haskell that represents computations as a series of steps. It provides the bind operation (»=) to chain operations and the return (or pure) function to inject values into the monadic context.
, and the ParseResult
parameter a
allows us to produce whatever we want from the parsersA function or program that interprets structured input, often used to convert strings into data structures.
we create.
Error Handling
Error handling is a very important part of any real-world parserA function or program that interprets structured input, often used to convert strings into data structures.
. Decent error reporting allows us to quickly diagnose problems in our input.
As we saw above a ParseResult
may be either a successful Result
or an Error
, the latter containing information in a ParseError
data structure about the nature of the error.
data ParseError =
UnexpectedEof -- hit end of file when we expected more input
| ExpectedEof Input -- should have successfully parsed everything but there’s more!
| UnexpectedChar Char
| UnexpectedString String
deriving (Eq, Show)
Naturally it needs to be Show
able, and we’ll throw in an Eq
for good measure.
Instances
First an instance of Show
to pretty print the ParseResult
s:
instance Show a => Show (ParseResult a) where
show (Result i a) = "Result >" ++ i ++ "< " ++ show a
show (Error UnexpectedEof) = "Unexpected end of stream"
show (Error (UnexpectedChar c)) = "Unexpected character: " ++ show [c]
show (Error (UnexpectedString s)) = "Unexpected string: " ++ show s
show (Error (ExpectedEof i)) =
"Expected end of stream, but got >" ++ show i ++ "<"
And ParseResult
is also an instance of Functor
so that we can map functions over the output of a successful parse—or do nothing if the result is an Error
:
instance Functor ParseResult where
fmap f (Result i a) = Result i (f a)
fmap _ (Error e) = Error e
A Parser
itself is also a Functor
. This allows us to create a new Parser
by composing functionality onto the parse
function for a given Parser
:
instance Functor Parser where
fmap :: (a -> b) -> Parser a -> Parser b
fmap f (P p) = P (fmap f . p)
The applicativeA type class in Haskell that extends Functor, allowing functions that are within a context to be applied to values that are also within a context. Applicative defines the functions pure and (<*>).
pure
creates a Parser
that always succeeds with the given input, and thus forms a basis for composition. We saw it being used in the above example to return the results of a parse back into the Parser
at the end of a do
block.
The (<*>)
allows us to map functions in the Parser
over another Parser
. As with other Applicative
instances, a common use case would be composition with a Parser
that returns a data constructor as we will see in the next example.
instance Applicative Parser where
pure :: a -> Parser a
pure x = P (`Result` x)
(<*>) :: Parser (a -> b) -> Parser a -> Parser b
(<*>) p q = p >>= (<$> q)
The Monad
instance’s bindthe defining function which all monads must implement.
function (>>=)
we have already seen in use in the example above, allowing us to sequence Parser
s in do
-blocks to build up the implementation of the BNF grammar.
instance Monad Parser where
(>>=) :: Parser a -> (a -> Parser b) -> Parser b
(>>=) (P p) f = P (
\i -> case p i of
Result rest x -> parse (f x) rest
Error e -> Error e)
Parser Combinators
The most atomic function for a parserA function or program that interprets structured input, often used to convert strings into data structures.
of String
is to pull a single character off the input. The only thing that could go wrong is to find our input is empty.
character :: Parser Char
character = P parseit
where parseit "" = Error UnexpectedEof
parseit (c:s) = Result s c
The following is how we will report an error when we encounter a character we didn’t expect. This is not the logic for recognising a character, that’s already happened and failed and the unrecognised character is now the parameter. This is just error reporting, and since we have to do it from within the context of a Parser
, we create one using the P
constructor. Then we set up the one field common to any Parser
, a function which returns a ParseResult
no matter the input, hence const
. The rest creates the right type of Error
for the given Char
.
unexpectedCharParser :: Char -> Parser a
unexpectedCharParser = P . const . Error . UnexpectedChar
Now a parserA function or program that interprets structured input, often used to convert strings into data structures.
that insists on a certain character being the next one on the input. It’s using the Parser
instance of Monad
’s bindthe defining function which all monads must implement.
function (implicitly in a do
block) to sequence first the character
Parser
, then either return the correct character in the Parser
, or the Error
parserA function or program that interprets structured input, often used to convert strings into data structures.
.
is :: Char -> Parser Char
is c = do
v <- character
let next = if v == c
then pure
else const $ unexpectedCharParser v
next c
And finally we introduce the Alternative
typeclass for our Parser
for trying to apply a first Parser
, and then an alternate Parser
if the first fails. This allows us to encode the alternativesA type class in Haskell that extends Applicative, introducing the empty and <|> functions for representing computations that can fail or have multiple outcomes.
in our BNF grammar rules.
instance Alternative Parser where
empty :: Parser a
empty = Parser $ const (Error UnexpectedEof)
p1 <|> p2 = P (\i -> let f (Error _) = parse p2 i
f r = r
in f $ parse p1 i)
Nitty gritty
The last two pieces of our Phone Numbers grammar we also implement fairly straightforwardly from the BNF. In a real parserA function or program that interprets structured input, often used to convert strings into data structures. combinatorA higher-order function that uses only function application and earlier defined combinators to define a result from its arguments. library you’d do it differently, as per our exercises below.
<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<spaces> ::= " " <spaces> | ""
Here’s a trivial adaptation of digit
:
digit :: Parser Char
digit = is '0' <|> is '1' <|> is '2' <|> is '3' <|> is '4' <|> is '5' <|> is '6' <|> is '7' <|> is '8' <|> is '9'
Spaces is a bit more interesting because it’s recursive, but still almost identical to the BNF:
spaces :: Parser ()
spaces = (is ' ' >> spaces) <|> pure ()
Exercises
-
make a less repetitive
digit
parserA function or program that interprets structured input, often used to convert strings into data structures. by creating a functionsatisfy :: (Char -> Bool) -> Parser Char
which returns a parserA function or program that interprets structured input, often used to convert strings into data structures. that produces a character but fails if the input is empty or the character does not satisfy the given predicate. You can use theisDigit
function fromData.Char
as the predicate. -
change the type of
spaces
toParser [Char]
and have it return the appropriately sized string of only spaces.
Solutions
We can generalise the is
parserA function or program that interprets structured input, often used to convert strings into data structures.
to handle a predicate
satisfy :: (Char -> Bool) -> Parser Char
satisfy predicate = do
c <- character
let next = if f c then pure else unexpectedCharParser
next c
digit :: Parser Char
digit = satisfy isDigit
spaces :: Parser [Char]
spaces = (do
_ <- is ' '
rest <- spaces
pure (' ' : rest)
) <|> pure []
We can do this recursively, by trying to parse as many as possible, or we can use the many function to parse many spaces.
spaces :: Parser [Char]
spaces = many (satisfy isSpace)
A Parser that returns an ADT
The return type of the phone number parserA function or program that interprets structured input, often used to convert strings into data structures.
above was [Char]
(equivalent to String
). A more typical use case for a parserA function or program that interprets structured input, often used to convert strings into data structures.
though is to generate some data structure that we can then process in other ways. In Haskell, this usually means a parserA function or program that interprets structured input, often used to convert strings into data structures.
which returns an Algebraic Data Type (ADT). Here is a very simple example.
Let’s imagine we need to parse records from a vets office. It treats only three types of animals. As always, lets start with the BNF:
<Animal> ::= "cat" | "dog" | "camel"
So our simple grammar consists of three terminalsIn the context of grammars, a terminal is a symbol that appears in the strings generated by the grammar. Terminals are the actual characters or tokens of the language.
, each of which is a straightforward string token (a constant string that makes up a primitive word in our language). To parse such a token, we’ll need a parserA function or program that interprets structured input, often used to convert strings into data structures.
which succeeds if it finds the specified string next in its input. We’ll use our is
parserA function or program that interprets structured input, often used to convert strings into data structures.
from above (which simply confirms a given character is next in its input). The type of is was Char -> Parser Char
. Since Parser
is an instance of Applicative
, we can simply traverse
the is
parserA function or program that interprets structured input, often used to convert strings into data structures.
across the given String
(list of Char
) to produce another String
in the Parser
applicativeA type class in Haskell that extends Functor, allowing functions that are within a context to be applied to values that are also within a context. Applicative defines the functions pure and (<*>).
context.
string :: String -> Parser String
string = traverse is
Now let’s define an ADT for animals:
data Animal = Cat | Dog | Camel
deriving Show
A parserA function or program that interprets structured input, often used to convert strings into data structures.
for “cat” is rather simple. If we find the string "cat"
we produce a Cat
:
cat :: Parser Animal
cat = string "cat" >> pure Cat
Let’s test it:
> parse cat "cat"
Result >< Cat
Ditto dogs and camels:
dog, camel :: Parser Animal
dog = string "dog" >> pure Dog
camel = string "camel" >> pure Camel
And now a parserA function or program that interprets structured input, often used to convert strings into data structures. for our full grammar:
animal :: Parser Animal
animal = cat <|> dog <|> camel
Some tests:
> parse animal "cat"
Result >< Cat
> parse animal "dog"
Result >< Dog
> parse animal "camel"
Result >< Camel
What’s really cool about this is that obviously the strings “cat” and “camel” overlap at the start. Our alternativeA type class in Haskell that extends Applicative, introducing the empty and <|> functions for representing computations that can fail or have multiple outcomes.
parserA function or program that interprets structured input, often used to convert strings into data structures.
(<|>)
effectively backtracks when the cat
parserA function or program that interprets structured input, often used to convert strings into data structures.
fails before eventually succeeding with the camel
parserA function or program that interprets structured input, often used to convert strings into data structures.
. In an imperativeImperative programs are a sequence of statements that change a programs state. This is probably the dominant paradigm for programming languages today. Languages from Assembler to Python are built around this concept and most modern languages still allow you to program in this style.
style program this kind of logic would result in much messier code.
Exercises
- Make a parserA function or program that interprets structured input, often used to convert strings into data structures.
stringTok
which uses thestring
parserA function or program that interprets structured input, often used to convert strings into data structures. to parse a given string, but ignores anyspaces
before or after the token. - Write some messy imperativeImperative programs are a sequence of statements that change a programs state. This is probably the dominant paradigm for programming languages today. Languages from Assembler to Python are built around this concept and most modern languages still allow you to program in this style. -style JavaScript (no higher-order functionsA function that takes other functions as arguments or returns a function as its result. allowed) to parse cat, dog or camel and construct a different class instance for each.
- Now add “dolphin” to the grammar, and use a stopwatch to time yourself extending your messy imperativeImperative programs are a sequence of statements that change a programs state. This is probably the dominant paradigm for programming languages today. Languages from Assembler to Python are built around this concept and most modern languages still allow you to program in this style.
code. I bet it takes longer than extending the
animal
parserA function or program that interprets structured input, often used to convert strings into data structures. combinatorA higher-order function that uses only function application and earlier defined combinators to define a result from its arguments. . - Modify the grammar and the ADT to have some extra data fields for each of the animal types, e.g.
humpCount
,remainingLives
,barkstyle
, etc. - Extend your parserA function or program that interprets structured input, often used to convert strings into data structures. to produce these records.
Solutions
- To create
stringTok
, we can make use of<<
or>>
to ignore parts of the result:
stringTok :: String -> Parser String
stringTok s = spaces >> string s << spaces
- Messy imperativeImperative programs are a sequence of statements that change a programs state. This is probably the dominant paradigm for programming languages today. Languages from Assembler to Python are built around this concept and most modern languages still allow you to program in this style. JavaScript to parse animals and construct appropriate class instances:
// Define the classes for Cat, Dog, and Camel
class Cat {
constructor() {
this.type = 'Cat';
}
}
class Dog {
constructor() {
this.type = 'Dog';
}
}
class Camel {
constructor() {
this.type = 'Camel';
}
}
class Dolphin {
constructor() {
this.type = 'Dolphin';
}
}
// Imperative parser function
function parseAnimal(input) {
let animal = null;
if (input === 'cat') {
animal = new Cat();
} else if (input === 'dog') {
animal = new Dog();
} else if (input === 'dolphin') {
animal = new Dolphin();
} else if (input === 'camel') {
animal = new Camel();
} else {
throw new Error('Invalid input');
}
return animal;
}
// Example usage
try {
const animal1 = parseAnimal('cat');
console.log(animal1); // Cat { type: 'Cat' }
const animal2 = parseAnimal('dog');
console.log(animal2);
}
Creating a Parse Tree
Programs are usually parsed into a tree structure called an Abstract Syntax TreeA tree representation of the abstract syntactic structure of a string of text. Each node in the tree represents a construct occurring in the text. (AST), more generally known as a parse tree. Further processing ultimately into an object file in the appropriate format (whether it’s some sort of machine code directly executable on the machine architecture or some sort of intermediate format—e.g. Java bytecode) then essentially boils down to traversal of this tree to evaluate the statements and expressions there in the appropriate order.
We will not implement a parserA function or program that interprets structured input, often used to convert strings into data structures. for a full programming language, but to at least demonstrate what this concept looks like in Haskell we will create a simple parserA function or program that interprets structured input, often used to convert strings into data structures. for simple arithmetic expressions. The parserA function or program that interprets structured input, often used to convert strings into data structures. generates a tree structure capturing the order of operations, which we may then traverse to perform a calculation.
To start with, here is a BNF grammar for a simple calculator with three operations *
, +
and -
, with *
having higher precedence than +
or -
:
<expr> ::= <term> | <expr> <addop> <term>
<term> ::= <number> | <number> "*" <number>
<addop> ::= "+" | "-"
An expression <expr>
consists of one or more <term>
s that may be combined with an <addop>
(an addition operation, either "+"
or "-"
). A <term>
involves one or more numbers, multiplied together.
The dependencies between the non-terminalA symbol in a grammar that can be replaced by a sequence of terminals and non-terminals according to the production rules of the grammar. expressions makes explicit the precedence of multiply operations needing to occur before add (and subtract).
The data structure we will create uses the following Algebraic Datatype:
data Expr = Plus Expr Expr
| Minus Expr Expr
| Times Expr Expr
| Number Integer
deriving Show
Our top-level function will be called parseCalc
:
parseCalc :: String -> ParseResult Expr
parseCalc = parse expr
And an example use might look like:
> parseCalc " 6 *4 + 3- 8 * 2"
Result >< Minus (Plus (Times (Number 6) (Number 4)) (Number 3)) (Times (Number 8) (Number 2))
Here’s some ASCII art to make the tree structure of the ParseResult Expr
more clear:
Minus
├──Plus
| ├──Times
| | ├──Number 6
| | └──Number 4
| └──Number 3
└──Times
├──Number 8
└──Number 2
Exercises
- make an instance of
show
forExpr
which pretty prints such trees - Make a function which performs the calculation specified in an
Expr
tree like the one above.
Obviously we are going to need to parse numbers, so let’s start with a simple parserA function or program that interprets structured input, often used to convert strings into data structures.
which creates a Number
.
Note that whereas our previous parserA function or program that interprets structured input, often used to convert strings into data structures.
had type phoneNumber :: Parser [Char]
—i.e. it produced strings—this, and most of the parsersA function or program that interprets structured input, often used to convert strings into data structures.
below, produces an Expr
.
number :: Parser Expr
number = spaces >> Number . read . (:[]) <$> digit
We keep things simple for now, make use of our existing digit
parserA function or program that interprets structured input, often used to convert strings into data structures.
, and limit our input to only single digit numbers.
The expression Number . read . (:[])
is fmapped over the Parser Char
returned by digit
.
We use the PreludeThe default library loaded in Haskell that includes basic functions and operators.
function read :: Read a => String -> a
to create the Int
expected by Number
. Since read
expects a string, we apply (:[])
to turn the Char
into [Char]
, i.e. a String
.
Next, we’ll need a parserA function or program that interprets structured input, often used to convert strings into data structures.
for the various operators (*
,+
and -
). There’s enough of them that we’ll make it a general purpose Parser Char
parameterised by the character we expect:
op :: Char -> Parser Char -- parse a single char operator
op c = do
spaces
is c
pure c
As before, spaces
ignores any number of ' '
characters.
Here’s how we use op
for *
; note that it returns only the Times
constructor. Thus, our return type is an as-yet unapplied binary function (and we see now why (<*>)
is going to be useful).
times :: Parser (Expr -> Expr -> Expr)
times = op '*' >> pure Times
And for +
and -
a straightforward implementation of the <addop>
non-terminalA symbol in a grammar that can be replaced by a sequence of terminals and non-terminals according to the production rules of the grammar.
from our grammar:
addop :: Parser (Expr -> Expr -> Expr)
addop = (op '+' >> pure Plus) <|> (op '-' >> pure Minus)
And some more non-terminalsA symbol in a grammar that can be replaced by a sequence of terminals and non-terminals according to the production rules of the grammar. :
expr :: Parser Expr
expr = chain term addop
term :: Parser Expr
term = chain number times
These use the chain
function to handle repeated chains of operators (*
, -
, +
) of unknown length. We could make each of these functions recursive with a <|>
to provide an alternativeA type class in Haskell that extends Applicative, introducing the empty and <|> functions for representing computations that can fail or have multiple outcomes.
for the base case end-of-chain (as we did for spaces
, above), but we can factor the pattern out into a reusable function, like so:
chain :: Parser a -> Parser (a->a->a) -> Parser a
chain p op = p >>= rest
where
rest :: a -> Parser a
rest a = (do
f <- op
b <- p
rest (f a b)
) <|> pure a
But, how does chain
work?
p >>= rest
: The parserA function or program that interprets structured input, often used to convert strings into data structures.
p
is applied, and we pass this parsed value, to the function call rest
rest a
: Within the rest function, the parserA function or program that interprets structured input, often used to convert strings into data structures.
op is applied to parse an operator f
, and the parserA function or program that interprets structured input, often used to convert strings into data structures.
p
is applied again to parse another value b
. The result is then combined using the function f
applied to both a
and b
to form a new value. The rest function is then called recursively, with this new value.
Recursive calls: The recursive calls continue until there are no more operators op
to parse, at which point the parserA function or program that interprets structured input, often used to convert strings into data structures.
returns the last value a
. This is achieved using the pure a
expression. This makes the function tail recursive
This gives us a way to parse expressions of the form “1+2+3+4+5” by parsing “1” initially, using p
then repeatedly parsing something of the form “+2”, where op
would parse the “+” and the p
would then parse the “2”. These are combined using our Plus
constructor to be of form Plus 1 2
, this will then recessively apply the p
and op
parserA function or program that interprets structured input, often used to convert strings into data structures.
over the rest of the string: “+3+4+5”.
But, can we re-write this using a fold?
chain :: Parser a -> Parser (a -> a -> a) -> Parser a
chain p op = foldl applyOp <$> p <*> many (liftA2 (,) op p)
where
applyOp :: a -> (a->a->a, a) -> a
applyOp x (op, y) = op x y
foldl applyOp <$> p
: This part uses the FunctorA type class in Haskell that represents types that can be mapped over. Instances of Functor must define the fmap function, which applies a function to every element in a structure.
instances to combine the parsed values and apply the operators in a left-associative manner. foldl applyOp
is partially applied to p
, creating a parserA function or program that interprets structured input, often used to convert strings into data structures.
that parses an initial value (p
) and then applies the left-associative chain of operators and values.
many ((,) <$> op <*> p)
: This part represents the repetition of pairs (op, p) using the many combinatorA higher-order function that uses only function application and earlier defined combinators to define a result from its arguments.
. The tuple structure here just allows us to store the pairs of op
and p
. We use liftA2 to lift both parse results in to the tuple constructor. We run this many times, to parse many pairs of op
and p
, and create a list of tuples. As a result, it creates a parserA function or program that interprets structured input, often used to convert strings into data structures.
that parses an operator (op) followed by a value (p) and repeats this zero or more times.
applyOp x (op, y)
: This function is used by foldl
to combine the parsed values and operators. It takes an accumulated value x
, an operator op
, and a new value y
, and applies the operator to the accumulated value and the new value.
Exercises
- Similar to
chain
, factor out the recursion ofspaces
into a function which returns a parserA function or program that interprets structured input, often used to convert strings into data structures. that continues producing a list of values from a given parserA function or program that interprets structured input, often used to convert strings into data structures. , i.e.list :: Parser a -> Parser [a]
.
Parsing Rock-Paper-Scissors
A common use-case for parsing is deserialising data stored as a string. Of course, there are general data interchange formats such as JSON and XML for which most languages have parsersA function or program that interprets structured input, often used to convert strings into data structures. available. However, sometimes you want to store data in your own format for compactness or readability, and when you do, deserialising the data requires a custom parserA function or program that interprets structured input, often used to convert strings into data structures. (this example is contributed by Arthur Maheo).
We will explore a small game of Rock-Paper-Scissors using a memory. The play function will have the following type:
data RockPaperScissors = Rock | Paper | Scissors
-- | Play a round of RPS given the result of the previous round.
play
:: Maybe (RockPaperScissors, RockPaperScissors, String)
-- ^ Result of the previous round as: (your choice, opponent choice, your memory)
-> (RockPaperScissors, String) -- ^ (Choice, new memory)
We will build a simple player which will keep track of the opponent’s previous choices and try to counter the most common one.
How to build a memory
We will convert to string using a simple Show
instance:
instance Show RockPaperScissors where
show :: RockPaperScissors -> String
show Rock = "R"
show Paper = "P"
show Scissors = "S"
(Note, we could also define a Read
instance to deserialise such a simple type but we are going to define a ParserCombinator
for interest and extensibility to much more complex scenarios).
The straightforward way to create the memory is to just store a list of all the choices made by the opponent. So, for example, if the results from the previous three rounds were:
(Rock, Paper), (Rock, Scissors), (Paper, Scissors)
Then, a compact memory representation will be: "PSS"
.
Note: We only store single characters, so we do not need separators, but if you have more complex data, you will want separators.
Reading the memory
Now, we want to define a Parser RockPaperScissors
which will turn a string into a choice.
First, we will define a parserA function or program that interprets structured input, often used to convert strings into data structures.
for each of the three choices:
rock :: Parser RockPaperScissors
rock = is 'R' >> pure Rock
scissors :: Parser RockPaperScissors
scissors = is 'S' >> pure Scissors
paper :: Parser RockPaperScissors
paper = is 'P' >> pure Paper
This will give:
>>> parse rock "R"
Result >< R
>>> parse rock "RR"
Result >R< R
>>> parse rock "P"
Unexpected character: "P"
To combine those parsersA function or program that interprets structured input, often used to convert strings into data structures.
, we will use the option parserA function or program that interprets structured input, often used to convert strings into data structures.
(<|>)
.
choice :: Parser RockPaperScissors
choice = rock <|> paper <|> scissors
And, to be able to read a list of choices, we need to use the list
parserA function or program that interprets structured input, often used to convert strings into data structures.
:
>>> parse choice "PSS"
Result >SS< P
>>> parse (list choice) "PSCS"
Result >CS< [P,S]
>>> parse (list choice) "PSS"
Result >< [P,S,S]
Playing the game
Our decision function will take a list of RockPaperScissors
and return the move that would win against most of them.
One question remains: how do we get the memory out of the parserA function or program that interprets structured input, often used to convert strings into data structures.
?
The answer is: pattern-matching.
getMem :: ParseResult a -> a
getMem (Result _ cs) = cs
getMem (Error _) = error "You should not do that!"
Obviously, in a longer program you want to be handling this case better.
Hint: If your parserA function or program that interprets structured input, often used to convert strings into data structures.
returns a list of elements, the empty list []
is a good default case.
Putting it all together
The first round, our player will just pick a choice at random and return an empty memory.
play Nothing = (Scissors, "") -- Chosen at random!
Now, we need to write a couple functions:
winAgainst
that determines which choice wins against a given one.mostCommon
which finds the most common occurrence in a list.
With that, we have a full play
function:
play (Just (_, opponent, mem)) = (winning whole, concatMap convert whole)
where
-- Convert the memory to a list of different choices
as_choices = getMem . parse (list choice)
-- Get the whole set of moves—all the prev. rounds + last one
whole = opponent: as_choices mem
winning = winAgainst . mostCommon
>>> play Nothing
(S,"")
>>> play (Just (Scissors, Scissors, ""))
(R,"S")
>>> play (Just (Scissors, Scissors, "RRP"))
(P,"SRRP")
Note: Here we can see the results directly because RockPaperScissors
has an instance of Show
.
If you want to do the same with a datatype without Show
, you would need to call convert
.
Going further
Now, this is a simplistic view of storing information. We are only concatenating characters because our data is so small. However, there are better ways to store that data.
One issue with this approach is that we need to process the memory sequentially at each round. Instead, we could keep track of the number of occurrences of each choice.
Exercise
Implement a memory for the following datatype.
data Played = Played {rocks, papers, scissors :: Int}
-- | Store a @Played@ as a string in format: @nC@, with @n@ the number of
-- occurrences and @C@ the choice.
convert' :: Played -> String
convert' Played{rocks, papers, scissors} =
show rocks ++ "R" ++ show papers ++ "P" ++ show scissors ++ "S"
>>> play Nothing
(S,"0R0P0S")
>>> play (Just (Scissors, Scissors, "0R0P0S"))
(R,"0R0P1S")
>>> play (Just (Scissors, Scissors, "2R1P0S"))
(P,"2R1P1S")
Glossary
Parser: A program that processes a string of text to extract structured information from it. Parsers are used in interpreting programming languages, data formats, and other structured text formats.
Context-Free Grammar: A type of formal grammar that is used to define the syntax of programming languages and data formats. CFGs consist of a set of production rules that define how terminals and non-terminals can be combined to produce strings in the language.
Backus-Naur Form: A notation for expressing context-free grammars. It is used to formally describe the syntax of programming languages.
Terminal: In the context of grammars, a terminal is a symbol that appears in the strings generated by the grammar. Terminals are the actual characters or tokens of the language.
Non-Terminal: A symbol in a grammar that can be replaced by a sequence of terminals and non-terminals according to the production rules of the grammar.
Parser Combinator: A higher-order function that takes parsers as input and combines them to create new parsers. Parser combinators are used to build complex parsers in a modular and compositional way.
Abstract Syntax Tree: A tree representation of the abstract syntactic structure of a string of text. Each node in the tree represents a construct occurring in the text.