Regular Expressions as a Monad

Defining Regular Expressions Regex as a Haskell Type Regex as a Monad Using the Regex Monad Defining Atoms Matching Quantified Atoms and Metacharacters Haskell Hurts Your Brain Regex Regex Next Steps

Defining Regular Expressions

Below is an implementation of regular expressions as a Haskell Monad. If you've never seen monads or are unfamiliar with the functions-operating-on-functions style of programming, I'd recommend finding a tutorial on that first. This will read better as follow up to your favorite Monad tutorial as an example of what makes something a Monad and how you might go about implementing one.

Why regular expressions as a monad? A monad is at its simplest, functions with some property that can be composed into larger operations with the same property. Regular expressions are functions of type String -> Match. The have the additional property that: some of the input string is consumed, the remainder is retained for future matches; and multiple Matches can exist, a failure backtracks to attempt with a different match. And two bits of regex can be joined to make a larger one, /\w+/ + /\d*/ = /\w+\d*/.

Regex as a Haskell Type

What is a regular expression? In your source code it's just a string, maybe some special rules to ignore the usual escape sequences.

data Regex = Regex String

But there's no particular reason a regex has to be a string. The perl module Regexp::English lets you construct Regex by chaining method calls,

$integer = Regexp::English
            -> start_of_line
            -> optional
               -> literal( '-' )
               -> end
            -> multiple
               -> digit
               -> end

Once the compiler gets to it, there's some internal parse-tree representation.

data Regex = Regex [RegexAtom]

data RegexAtom
  = Literal String
  | CharClass String
  | Multiple Int Int RegexAtom
  | ...

We could certainly represent it like that, and create a virtual machine that interprets [RegexAtom]. That may be how you'd implement a more serious regex engine, but not that good of an approach if you're using it as springboard to understand monads. And it's not what a regex actually is, i.e. a function.

                .---------.
"some text" --> | /regex/ | --> match?
                '---------'

Which is how most non-perl languages represent it,

// C#, System.Text.RegularExpressions

var re = new Regex( "needle" );
var match = re.Matches( "haystack" );

And with all the tools a functional language provides that should be easy to work with,

data Regex = Regex (String -> Match)

To finish the definition, what's the result of executing a regex? In the simplest case, it can fail to match:

data Match = NoMatch

A successful match needs to include the remainder of the input (so that it can be composed with a following fragment of regex); the result of the match (in whatever format the match provides); and finally other ways the match could have happened (to backtrack and retry if a later match fails).

data Match r
    = NoMatch
    | Match String r (Match r)

Which looks familiar,

data List a
    = Nil
    | Cons a (List a)

Leaving the final type,

P<file://D:\projects\HSRE\Regex.hs#RegexType>

Regex as a Monad

Monads are defined by the >>= (a.k.a. bind) and return functions,

P<file://D:\projects\HSRE\Regex.hs#RegexMonad>

Return gives some arbitrary value the properties that makes the type a monad. For a regex, that's matching against a string, consuming the matched portion, and listing the other ways it could have matched for backtracking. For the no-op of return, we'll always match, leave the input untouched, and yield no other options to backtrack to.

P<file://D:\projects\HSRE\Regex.hs#RegexMonadReturn>

Bind takes two computations (the second of which may depend on the first) and join them together. So given two regex /a/ and /b/ what's the meaining of the combined regex /ab/? /b/ should only match against the remaining string that /a/ didn't consume. And if /b/ doesn't match, it backtracks to try another way /a/ could have matched.

The list monad reduces that to an elegant two lines: for each way the first regex can match, attempt to match the second starting from that point. Lazy evaluation takes care of the rest leaving the backtrack cases unevaluated until needed.

P<file://D:\projects\HSRE\Regex.hs#RegexMonadBind>

And then fill in the gaps; Monad implies Applicative and Functor. Both can trivially be derived using Control.Monad.

P<file://D:\projects\HSRE\Regex.hs#RegexClass>

A regex can also not match at all, and having the MonadFail instance will save some trouble later. An unconditional failure is a regex that never matches, ignores its input, and has no other ways to match.

P<file://D:\projects\HSRE\Regex.hs#RegexMonadFail>

Using the Regex Monad

Defining Atoms

Matching a single character consumes the next head of the input string. If there's nothing left in the input string it trivially can't match. If it's a character we want, that's a match and continue on with the remainder of the string.

anyChar :: Regex Char
anyChar = Regex anyChar'
  where
    anyChar' []     = []
    anyChar' (x:xs) = [MatchResult xs x]

literal :: Char -> Regex Char
literal c = Regex literal'
  where
    literal' []   = []
    literal' (x:xs)
      | x == c    = [MatchResult xs c]
      | otherwise = []

charClass :: [Char] -> Regex Char
charClass cc = Regex charClass'
  where
    charClass' []     = []
    charClass' (x:xs)
      | x `elem` cc   = [MatchResult xs x]
      | otherwise     = []

And that looks like a pattern, abstract it away,

P<file://D:\projects\HSRE\Regex.hs#RegexAtoms>

Which is enough to start using the library.

Matching

Most of the time we ask if a regex matches, we just want the first time it matches anywhere in the string. I.e., try it starting at the beginning and advance to the end until a match is found.

P<file://D:\projects\HSRE\Regex.hs#RegexFindFirst>

Alternatively, some libraries have a formulation that behaves like /^a$/, only returning a match if the entire string is consumed. No need to retry at later positions in the string. But we do need to assert that the end of the input has been reached.

P<file://D:\projects\HSRE\Regex.hs#RegexEOF>
P<file://D:\projects\HSRE\Regex.hs#RegexFindExact>

*Main> findFirst "a1b2" $ sequence [ digit, anyChar ] Just "1b" *Main> findFirst "a1b2" $ sequence [ digit, literal 'a' ] Nothing

Quantified Atoms and Metacharacters

/a|b/

One last poke at the internals of the Regex type is required here to define the backtracking logic when two or more regex fragments may match. This matches the first fragment of regex, if backtracking exhausts all the ways it could match, attempts to match the next fragment.

P<file://D:\projects\HSRE\Regex.hs#RegexAlt>

/a?/

And that's all that's needed to implement the quantifiers. /a?/ can either match a, or nothing at all, i.e. /a|/.

P<file://D:\projects\HSRE\Regex.hs#RegexOpt>

/a*/

/a*/ can be described as matching /a?/ until an empty match. And /a+/ transformed to /aa*/.

P<file://D:\projects\HSRE\Regex.hs#RegexMany>

Haskell Hurts Your Brain

Regex Regex

The justification of Regex a earlier was a bit of a hand-wave. Isn't the result of a Regex match always just the portion of the string matched? Sure you could think of saving some time by parsing and validating within the regex. Or the Perl6/Raku functionality of embedding logic in the middle of the regex,

P<file://D:\projects\HSRE\Regex.hs#Ip4>

*Main> findExact "127.0.0.1" ipAddress Just (127,0,0,1) *Main> findExact "256.0.0.1" ipAddress Nothing

A parser has the same type we've given Regex, taking a string and producing some type of structure. Regular expressions can parse any regular grammar, for instance, regular expressions. There's no reason why the a in Regex a couldn't be Regex right?

compile :: String -> Regex (Regex String)

The Haskell implementation will look very similar to the grammar as if we were writing any other parser. Only instead of building up a syntax tree or some other intermediate structure, the parser will return fragments of ready-to-execute Regex code.

regex           ::= qualified-atom+
                  | qualified-atom+ '|' regex

qualified-atom  ::= atom qualifier?

atom            ::= char-class
                  | group
                  | escape
                  | any-char
                  | literal

qualifier       ::= '?'
                  | '*'
                  | '+'
                  | '{' digit+ '}'
                  | '{' digit+ ',' digit* '}'

Nothing unusual with the atoms. The various regex symbols are translated into the single-character matchers defined above. To keep the types consistent liftM (:[]) turns the character-matchers into string-matchers.

P<file://D:\projects\HSRE\Regex.hs#CompileAtom>

The quantifiers really show off functional programming's strengths. The input is parsed to its literal meaning as a modifier on a fragment of regex, Regex String -> Regex String.

P<file://D:\projects\HSRE\Regex.hs#CompileQuant>

Apply the quantifier to the atom. And then bind all those individual bits into the the complete regex.

P<file://D:\projects\HSRE\Regex.hs#Compile>

We're Perl now,

P<file://D:\projects\HSRE\Regex.hs#Perl>

*Main> "Ping statistics for 23.239.13.224:" =~ "\\d+(\\.\\d+){3}" Just "23.239.13.224" *Main> "Packets: Sent = 4, Received = 4, Lost = 0 (0% loss)," =~ "\\d+(\\.\\d+){3}" Nothing

Next Steps

From here on, it's mostly just adding state and threading it through the bind function. Which in turn is mostly just re-implementing the State Monad to allow backtracking. Nothing particularly interesting from a monad perspective (in fact, my implementation changes little in the code above other than to replace Regex a with Regex s a.

There are some interesting things you can do with type classes to transparently support numbered captures vs. named captures. Or allow the regex to match against a stream of anything not just strings.

class RegexState'Basic s a where
  peek'input  :: a -> Maybe s
  take'input  :: a -> (Maybe s, a)
  at'start    :: a -> Bool
  at'end      :: a -> Bool

class RegexState'Capture k g a where
  set'capture :: k -> g -> a -> a
  get'capture :: k -> a -> Maybe g

Maybe some other time.

Programming

Games

Tools