Regular Expressions as a Monad
Defining Regular ExpressionsRegex as a Haskell TypeRegex as a MonadUsing the Regex MonadDefining AtomsMatchingQuantified Atoms and MetacharactersHaskell Hurts Your BrainRegex RegexNext Steps
Defining Regular Expressions
Below is an implementation of regular expressions as a Haskell Monad. If you've never seen monads or are unfamiliar with the functions-operating-on-functions style of programming, I'd recommend finding a tutorial on that first. This will read better as follow up to your favorite Monad tutorial as an example of what makes something a Monad and how you might go about implementing one.
Why regular expressions as a monad? A monad is at its simplest, functions with some property that can be composed into larger operations with the same property. Regular expressions are functions of type String -> Match. The have the additional property that: some of the input string is consumed, the remainder is retained for future matches; and multiple Matches can exist, a failure backtracks to attempt with a different match. And two bits of regex can be joined to make a larger one, /\w+/ + /\d*/ = /\w+\d*/.
Regex as a Haskell Type
What is a regular expression? In your source code it's just a string, maybe some special rules to ignore the usual escape sequences.
data Regex = Regex String
But there's no particular reason a regex has to be a string. The perl module Regexp::English lets you construct Regex by chaining method calls,
$integer = Regexp::English -> start_of_line -> optional -> literal( '-' ) -> end -> multiple -> digit -> end
Once the compiler gets to it, there's some internal parse-tree representation.
data Regex = Regex [RegexAtom] data RegexAtom = Literal String | CharClass String | Multiple Int Int RegexAtom | ...
We could certainly represent it like that, and create a virtual machine that interprets [RegexAtom]. That may be how you'd implement a more serious regex engine, but not that good of an approach if you're using it as springboard to understand monads. And it's not what a regex actually is, i.e. a function.
.---------. "some text" --> | /regex/ | --> match? '---------'
Which is how most non-perl languages represent it,
// C#, System.Text.RegularExpressions var re = new Regex( "needle" ); var match = re.Matches( "haystack" );
And with all the tools a functional language provides that should be easy to work with,
data Regex = Regex (String -> Match)
To finish the definition, what's the result of executing a regex? In the simplest case, it can fail to match:
data Match = NoMatch
A successful match needs to include the remainder of the input (so that it can be composed with a following fragment of regex); the result of the match (in whatever format the match provides); and finally other ways the match could have happened (to backtrack and retry if a later match fails).
data Match r = NoMatch | Match String r (Match r)
Which looks familiar,
data List a = Nil | Cons a (List a)
Leaving the final type,
P<file://D:\projects\HSRE\Regex.hs#RegexType>
Regex as a Monad
Monads are defined by the >>= (a.k.a. bind) and return functions,
P<file://D:\projects\HSRE\Regex.hs#RegexMonad>
Return gives some arbitrary value the properties that makes the type a monad. For a regex, that's matching against a string, consuming the matched portion, and listing the other ways it could have matched for backtracking. For the no-op of return, we'll always match, leave the input untouched, and yield no other options to backtrack to.
P<file://D:\projects\HSRE\Regex.hs#RegexMonadReturn>
Bind takes two computations (the second of which may depend on the first) and join them together. So given two regex /a/ and /b/ what's the meaining of the combined regex /ab/? /b/ should only match against the remaining string that /a/ didn't consume. And if /b/ doesn't match, it backtracks to try another way /a/ could have matched.
The list monad reduces that to an elegant two lines: for each way the first regex can match, attempt to match the second starting from that point. Lazy evaluation takes care of the rest leaving the backtrack cases unevaluated until needed.
P<file://D:\projects\HSRE\Regex.hs#RegexMonadBind>
And then fill in the gaps; Monad implies Applicative and Functor. Both can trivially be derived using Control.Monad.
P<file://D:\projects\HSRE\Regex.hs#RegexClass>
A regex can also not match at all, and having the MonadFail instance will save some trouble later. An unconditional failure is a regex that never matches, ignores its input, and has no other ways to match.
P<file://D:\projects\HSRE\Regex.hs#RegexMonadFail>
Using the Regex Monad
Defining Atoms
Matching a single character consumes the next head of the input string. If there's nothing left in the input string it trivially can't match. If it's a character we want, that's a match and continue on with the remainder of the string.
anyChar :: Regex Char anyChar = Regex anyChar' where anyChar' [] = [] anyChar' (x:xs) = [MatchResult xs x] literal :: Char -> Regex Char literal c = Regex literal' where literal' [] = [] literal' (x:xs) | x == c = [MatchResult xs c] | otherwise = [] charClass :: [Char] -> Regex Char charClass cc = Regex charClass' where charClass' [] = [] charClass' (x:xs) | x `elem` cc = [MatchResult xs x] | otherwise = []
And that looks like a pattern, abstract it away,
P<file://D:\projects\HSRE\Regex.hs#RegexAtoms>
Which is enough to start using the library.
Matching
Most of the time we ask if a regex matches, we just want the first time it matches anywhere in the string. I.e., try it starting at the beginning and advance to the end until a match is found.
P<file://D:\projects\HSRE\Regex.hs#RegexFindFirst>
Alternatively, some libraries have a formulation that behaves like /^a$/, only returning a match if the entire string is consumed. No need to retry at later positions in the string. But we do need to assert that the end of the input has been reached.
P<file://D:\projects\HSRE\Regex.hs#RegexEOF> P<file://D:\projects\HSRE\Regex.hs#RegexFindExact>
*Main> findFirst "a1b2" $ sequence [ digit, anyChar ] Just "1b" *Main> findFirst "a1b2" $ sequence [ digit, literal 'a' ] Nothing
Quantified Atoms and Metacharacters
/a|b/
One last poke at the internals of the Regex type is required here to define the backtracking logic when two or more regex fragments may match. This matches the first fragment of regex, if backtracking exhausts all the ways it could match, attempts to match the next fragment.
P<file://D:\projects\HSRE\Regex.hs#RegexAlt>
/a?/
And that's all that's needed to implement the quantifiers. /a?/ can either match a, or nothing at all, i.e. /a|/.
P<file://D:\projects\HSRE\Regex.hs#RegexOpt>
/a*/
/a*/ can be described as matching /a?/ until an empty match. And /a+/ transformed to /aa*/.
P<file://D:\projects\HSRE\Regex.hs#RegexMany>
Haskell Hurts Your Brain
Regex Regex
The justification of Regex a earlier was a bit of a hand-wave. Isn't the result of a Regex match always just the portion of the string matched? Sure you could think of saving some time by parsing and validating within the regex. Or the Perl6/Raku functionality of embedding logic in the middle of the regex,
P<file://D:\projects\HSRE\Regex.hs#Ip4>
*Main> findExact "127.0.0.1" ipAddress Just (127,0,0,1) *Main> findExact "256.0.0.1" ipAddress Nothing
A parser has the same type we've given Regex, taking a string and producing some type of structure. Regular expressions can parse any regular grammar, for instance, regular expressions. There's no reason why the a in Regex a couldn't be Regex right?
compile :: String -> Regex (Regex String)
The Haskell implementation will look very similar to the grammar as if we were writing any other parser. Only instead of building up a syntax tree or some other intermediate structure, the parser will return fragments of ready-to-execute Regex code.
regex ::= qualified-atom+ | qualified-atom+ '|' regex qualified-atom ::= atom qualifier? atom ::= char-class | group | escape | any-char | literal qualifier ::= '?' | '*' | '+' | '{' digit+ '}' | '{' digit+ ',' digit* '}'
Nothing unusual with the atoms. The various regex symbols are translated into the single-character matchers defined above. To keep the types consistent liftM (:[]) turns the character-matchers into string-matchers.
P<file://D:\projects\HSRE\Regex.hs#CompileAtom>
The quantifiers really show off functional programming's strengths. The input is parsed to its literal meaning as a modifier on a fragment of regex, Regex String -> Regex String.
P<file://D:\projects\HSRE\Regex.hs#CompileQuant>
Apply the quantifier to the atom. And then bind all those individual bits into the the complete regex.
P<file://D:\projects\HSRE\Regex.hs#Compile>
We're Perl now,
P<file://D:\projects\HSRE\Regex.hs#Perl>
*Main> "Ping statistics for 23.239.13.224:" =~ "\\d+(\\.\\d+){3}" Just "23.239.13.224" *Main> "Packets: Sent = 4, Received = 4, Lost = 0 (0% loss)," =~ "\\d+(\\.\\d+){3}" Nothing
Next Steps
From here on, it's mostly just adding state and threading it through the bind function. Which in turn is mostly just re-implementing the State Monad to allow backtracking. Nothing particularly interesting from a monad perspective (in fact, my implementation changes little in the code above other than to replace Regex a with Regex s a.
There are some interesting things you can do with type classes to transparently support numbered captures vs. named captures. Or allow the regex to match against a stream of anything not just strings.
class RegexState'Basic s a where peek'input :: a -> Maybe s take'input :: a -> (Maybe s, a) at'start :: a -> Bool at'end :: a -> Bool class RegexState'Capture k g a where set'capture :: k -> g -> a -> a get'capture :: k -> a -> Maybe g
Maybe some other time.