There comes a time in every programmer’s life when a plain old string just won’t cut it anymore - you need more structure. However, the more you think about this, the more you are terrified of the prospect of having to write some sort of parser, or maybe it’s time to put on some protective clothing on and fight through a forest of regular expression operators and arcane escaping rules. Why why why, you ask yourself, does it have to be this painful?
Ok, maybe that was a little melodramatic, but I can say that parsing with Haskell is an activity that I no longer dread, in fact - I find it fun and exciting! parsec
is a parser combinator library that provides a set of primitives for parsing. As I’ve been stressing throughout this series (and will continue to!), combinators let us build complex logic out of very small, easy to reason about, pieces.
For the example today, we’ll look at parsing International Standard Recording Codes (ISRCs), a problem that I had to solve for some work on MusicBrainz. The linked Wikipedia does a good job at explaining the format of ISRCs, so lets dive right in and try and build up our own ISRC parser. First of all, we need to parse the country code. That’s just two uppercase characters, so lets write that parser:
= count 2 upper countryCode
Simple! This parser requires two uppercase characters, so we’ve combined the upper
parser with the count
combinator to run upper
twice. If this parser succeeds, it will return the two uppercase characters that it parsed. Moving on, the next section of data we need to parse is “a three character alphanumeric registrant code”:
= count 3 upperNum
regCode where upperNum = upper <|> digit
Much like our countryCode
parser, we’ve combined the upperNum
parser with the count
combinator to run it 3 times. upperNum
can be defined by parsing either an upper
character (A-Z) or a digit
(0-9). The <|>
combinator, part of the Alternative
type class, allows us to try one parser and if that fails, try another.
Next, we take the last two digits of the year of registration:
= count 2 digit regYear
And finally, we take the five digit number identifying the recording:
= count 5 digit recordingId
All that we need to do now is thread all these parsers together, and we’re done! One option could be to do the following:
data ISRC = ISRC { isrcCountryCode :: String
isrcRegCode :: String
, isrcRegYear :: Int
, isrcRecording :: Int
,deriving (Show)
}
= ISRC <$> countryCode
isrcParser <*> regCode
<*> fmap read regYear
<*> fmap read recordingId
<* eof
> parse isrcParser "" "USPR37300012"
λRight (ISRC { isrcCountryCode = "US"
= "PR3"
, isrcRegCode = 73
, isrcRegYear = 12}) , isrcRecording
Here I’ve used the Applicative
interface to build a parser that gradually builds up an ISRC
value, and also the Functor
instance to convert regCode
’s into an Int
, as ISRC
requires. I’ve finished off by using eof
to indicate that the parser should fail if there is any left over input.
However, this isn’t the only way to use parsec
, though it’s certainly a convenient way. In practice, I actually only wanted to perform validation and normalization, so I can get by just sequencing these parsers and then stitching things back together:
= mconcat <$>
isrcParser₂ sequence [ countryCode, regCode, regYear, recordingId ]
<* eof
> parse isrcParser₂ "" "USPR37300012"
λRight "USPR37300012"
This parser uses the sequence
combinator for Monad
s to run a bunch of parsers, and collect all their results - then I just mconcat
the results into a single String
.
Oh, did I just mention parsers are Monad
s? This gives us even more power - for example, we could branch out and run different parsers depending on what we’ve parsed so far. Not only that, but they are monad transformers, which gives us even more power. If we used a base monad such as IO
, it’s entirely possible to write a parser that looks up things in a database while it does the parsing. Maybe a little bit on the crazy side, but it’s all possible!
You can contact me via email at ollie@ocharles.org.uk or tweet to me @acid2. I share almost all of my work at GitHub. This post is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.