We're still under construction!

And you just found out where. We're still working on this page. How did you get in here, anyway?

Writing a Parser Combinator Library in Haskell

We write our own stripped-down version of the popular parsec library.

Parser Combinators real quick

Parser-combinators are what truly got me into Haskell. I saw a video of a professor giving a lecture on building a parser from scratch. He did it with just a handful of code, over the course of an hour, explaining everything along the way.

I was at the time working with Erlang due to my interest in distributed computing and filesystems, but I was already entertaining the idea of learning haskell. There was a distributed computing library somewhat modeled after Erlang that I wanted to check out (the now-defunct cloud-haskell project, which I have carried on its spirit). This video made me jump ship entirely.

So what are parser combinators? They're a great way of building larger parsers our of bits of smaller parsers. You take one parser and combine it with another parser. Parser combinators! Quite often, you can end up with a parser that is roughly in the shape of the data it is parsing, which is a rather neat and useful structural trick.

So with that out of the way, let's get to defining our project cabal file! It's rather short, being 90% boilerplate. We only are using the base library, and we only have the one Lambda.Parser module.

With the project file done, there's nowhere to go now except to the code.

NOTE: We're going to try and keep it rather simple, and try to avoid using higher-level haskell concepts in favor of conceptual clarity.

cabal project file - lc-parser.cabal

cabal-version:      3.0
name:               lc-parser
version:            0.0.1
license:            NONE
author:             Leo
maintainer:         leo@apotheca.io
build-type:         Simple
extra-source-files: README.md

        base >= 4 && < 5
    hs-source-dirs:   src
    default-language: Haskell2010

Module and Imports

First up, we have our module declaration and imports. We're not doing anything really fancy, so there's not all that many of them

We pull in Prelude of course, and then some standard data types Bool, Char, Either and Maybe. This gives us the majority of what we need for building our parser.

We also pull in Applicative and Monad for a few important bits used for gluing parsers together - we'll cover the whole Functor-Applicative-Monad thing in depth some day.

Finally, we pull in a cheeky little first function from Bifunctor so we can map over the first element of a tuple later. We could use first from Arrow, as those two functions happen to coincide for tuples, but Arrow generalizes in a different direction than what we want so we'll stick with Bifunctor.

haskell - src/Lambda/Parser.hs

module Lambda.Parser where

import Prelude

import Data.Bool
import Data.Char
import Data.Either
import Data.Maybe

import Control.Applicative
import Control.Monad

import Data.Bifunctor (first)

Parser types

Next, we define our parser data type, plus a supporting error type.

So what is a parser anyway? In a broad sense, just it is a function that reads in some input, and produces some output, returning the result along with the rest of the remaining input. Then, the next parser can continue where the first one left off.

This corresponds to a type that looks like String -> (a, String), which works well so long as you only give it input that doesn't fail to parse. This makes it a bit nitpicky to use (not unlike partial functions), as in the real world, we don't always know if a parser is going to succeed ahead of time.

So in practice, our parser will handle the possibility of errors by wrapping the returned (a, String) with an Either ParserError. We could throw an error or exception instead of explicit error handling and keep the String -> (a, String) type clean of mentioning the possibility of errors, but that would require hiding the error using the fact that the Parser is a Monad, and would involve invoking the dreaded exception-catching machinery which is out of scope for this article.

With that explanation out of the way, our Parser data type is just a newtype wrapper around a String -> Either ParseError (a, String) function. This makes it easy to just say Parser a instead of constantly copy-pasting String -> Either ParseError (a, String) everywhere. We also define our ParseError which will tell us a bit about what went wrong in the case that things do. We also define a quick parse function for completely ignoring the error handling that we just talked about.

newtype Parser a = Parser { runParser :: String -> Either ParseError (a, String) }

data ParseError
    = EndOfInput
    | InputRemaining String
    | Unexpected String
    | Failure String
    | Empty
    deriving (Show)

parse :: Parser a -> String -> a
parse p s = case runParser p s of
    Left e          -> error $ show e
    Right (a,[])    -> a
    Right (a,s)     -> error "unexpected input remaining"

Parser class instances

Now, we have a bit of boilerplate to define. As I said earlier, we'll cover the Functor Applicative Monad stack and related concepts more fully elsewhere, but we are going to implement some conforming instances and use them, so we'll briefly touch on their concepts anyway. We've got to define this stuff because it lets us glue our parsers together easily.

Functors are abstract contexts that generalize functions and containers, and allow you to map over them. So they may 'contain' a thing in a concrete manner, like Arrays and Dictionaries, but they also may contain things in an abstract way, such as how a function 'contains' the answers to all of its results. We're using it in the second sense - our parsers 'contain' their result, even though we've not provided them any input yet! How strange!

Looking at our implmentation of map (or, fmap as it is actually named), we can see that it just takes a function f :: a -> b and applies it to the a in the (a, String), turning our Parser a into a Parser b.

Applicative is Functor with a bit extra - we start being able to 'put things in' to the to the parser with pure, and can apply functions that are 'in the functor' to data that is also 'in the functor', like how fmap lets us apply a function outside of the functor data that is 'in the functor'. Monad lets us do these things in a specific order, so we can be sure that the past happens before the future Alternative gives us a way of continuing with a second parser if the first parser fails, allowing us to differentiate many (zero or more) from some (one or more).

There's simpler ways to define some of these implementations - taking advantage of Either's Monad instance, or using list comprehensions, or do syntax - but we've kept it simple to keep it self-contained.

Congratulations! This was actually the hardest part! Although we haven't parsed a single thing yet, that's because we havent actually defined any parsers yet, although we have set up most of the machinery that we'll need to do it!

instance Functor Parser where
    fmap f p = Parser $ \ s -> case runParser p s of
        Left e          -> Left e
        Right (a, s')   -> Right (f a, s')

instance Applicative Parser where
    pure a = Parser $ \ s -> Right (a, s)
    (<*>) p q = Parser $ \ s -> case runParser p s of
        Left e          -> Left e
        Right (f, s')   -> case runParser q s' of
            Left e          -> Left e
            Right (a, s'')  -> Right (f a, s'')

instance Alternative Parser where
    empty = Parser $ \ s -> Left Empty
    (<|>) p q = Parser $ \ s -> case runParser p s of
        Left e  -> runParser q s
        a       -> a

instance Monad Parser where
    (>>=) p m = Parser $ \ s -> case runParser p s of
        Left e          -> Left e
        Right (a, s')   -> runParser (m a) s'

Char and String parsers

At last, we get to actual parsing. Where to begin?

Well, the first parser to write, the parser that we always have, is the parser that returns the first part of any input. Since our input is a List of Chars aka a String, that means we need to write the parser anyChar :: Parser Char. It simply takes the first character of the string, and returns it as the result with the rest of the string as the remaining input, else erroring with an unexpected EndOfInput if the string is empty.

We can now run parse anyChar "a", and get the result 'a'. If we try to parse an empty string, it fails with EndOfInput.

This alone isn't useful - we need to be able to match our input against a predicate to have behavior determined by the input. In other words, it helps to be able to ask what a piece of input is, or if it matches our criteria. For this, we need the parser satisfy :: (Char -> Bool) -> Parser Char. It is like the anyChar parser, except we check the input against the predicate, returning it if it passes, and returning an Unexpected error if it fails.

We can now run parse (satisfy isAlpha) and succeed on "a" but fail on "0". In fact, we can implement a slew of simple character parsers using a bunch of predicates from Data.Char.

Finally, we can also use create a string parser that matches a sequence of characters!

anyChar :: Parser Char
anyChar = Parser $ \ s -> case s of
    []      -> Left EndOfInput
    (c:cs)  -> Right (c, cs)

satisfy :: (Char -> Bool) -> Parser Char
satisfy p = Parser $ \ input -> case input of
    []      -> Left EndOfInput
    (c:cs)  -> if p c
        then  Right (c, cs)
        else  Left (Unexpected (c:[]))

char :: Char -> Parser Char
char c = satisfy (== c)

oneOf :: [Char] -> Parser Char
oneOf cs = satisfy (flip elem cs)

noneOf :: [Char] -> Parser Char
noneOf cs = satisfy (not . flip elem cs)

lower :: Parser Char
lower = satisfy isLower

upper :: Parser Char
upper = satisfy isUpper

letter :: Parser Char
letter = satisfy isAlpha 

digit :: Parser Char
digit = satisfy isDigit

alphaNum :: Parser Char
alphaNum = satisfy isAlphaNum

string :: String -> Parser String
string [] = pure []
string (c:cs) = (:) <$> char c <*> string cs

spaces :: Parser String
spaces = many $ oneOf " \n\r\t"


With our primitive character parsers in place, it is time to start combining them.

Just as it is nice to be able to satisfy a predicate because we aren't certain what is going to be next, sometimes we want to parse more than just a single input at a time - we want to be able to parse a list of something, or maybe we want a sequence of one thing separated by another. To that end, we've implemented some of the useful combinators from Parsec.

Give it a try! Run parse (between (char '(') (char ')') (some digit)) "(123)" and it should yield "123"!

All of this enables us to write parsers of higher and higher order, allowing us to construct more complex data types from our primatives. Try building a parser that turns a string of digits into an integer.

choice :: [Parser a] -> Parser a
choice ps = foldr (<|>) empty ps

count :: Int -> Parser a -> Parser [a] 
count n p 
    | n <= 0    = return []
    | otherwise = sequence (replicate n p)

between :: Parser a -> Parser b -> Parser c -> Parser c
between o c a = do
    a' <- a
    pure a'

surround :: Parser a -> Parser b -> Parser b
surround q a = q *> a <* q

chainl :: Parser a -> Parser (a -> a -> a) -> a -> Parser a
chainl p op a = chainl1 p op <|> return a

chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
chainl1 p op = p >>= loop where 
    loop x = (do
        f <- op
        y <- p
        loop (f x y)
        ) <|> return x

manySep :: Parser a -> Parser sep -> Parser [a]
manySep p sep = someSep p sep <|> return []

someSep :: Parser a -> Parser sep -> Parser [a]
someSep p sep = do
    x <- p
    xs <- many (sep >> p)
    return (x:xs)

someEnd :: Parser a -> Parser sep -> Parser [a]
someEnd p sep = some $ do
    x <- p
    return x

manyEnd :: Parser a -> Parser sep -> Parser [a]
manyEnd p sep  = many $ do
    x <- p
    return x

manySepEnd :: Parser a -> Parser sep -> Parser [a]
manySepEnd p sep = someSepEnd p sep <|> return []

someSepEnd :: Parser a -> Parser sep -> Parser [a]
someSepEnd p sep = do
    x <- p
        xs <- manySepEnd p sep
        return (x:xs)
        ) <|> return [x]

eof :: Parser ()
eof = Parser $ \ s -> case s of
    [] -> Right ((),[])
    _ -> Left (InputRemaining s)

And that's it!

Aside from a bit of nomenclature (fooBy vs manyFoo, fooBy1 vs someFoo), it's practically a drop-in replacement for Parsec in our earlier use cases!

At some point in the future, we will actually replace our use of parsec with this library. Part of the reason we kept it simple and avoided use of extended features such as even classes was so that we can implement a version of this using our lambda calculus interpreter!

Stay tuned!

TODO: Add repo / source download links.

- Leo D., Undated, revised / edited / published Aug 2022