Index

Programming

Imperative view of Monads

Regex as a Monad

Games

Shunt

Confused Carols

+・

Hope Descending

Pong in the Dark

Compoundle

HoloYacht

Mixed Music

Vlajorne

Triptych

Mifra

Tools

MangaDex Viewer

Hololive Music Lookup

Regular Expressions as a Monad

Defining Regular ExpressionsRegex as a Haskell TypeRegex as a MonadUsing the Regex MonadDefining AtomsMatchingQuantified Atoms and MetacharactersHaskell Hurts Your BrainRegex RegexNext Steps

Defining Regular Expressions

Below is an implementation of regular expressions as a Haskell Monad. If you've never seen monads or are unfamiliar with the functions-operating-on-functions style of programming, I'd recommend finding a tutorial on that first. This will read better as follow up to your favorite Monad tutorial as an example of what makes something a Monad and how you might go about implementing one.

Why regular expressions as a monad? A monad is at its simplest, functions with some property that can be composed into larger operations with the same property. Regular expressions are functions of type String -> Match. The have the additional property that: some of the input string is consumed, the remainder is retained for future matches; and multiple Matches can exist, a failure backtracks to attempt with a different match. And two bits of regex can be joined to make a larger one, /\w+/ + /\d*/ = /\w+\d*/.

Regex as a Haskell Type

What is a regular expression? In your source code it's just a string, maybe some special rules to ignore the usual escape sequences.

data Regex = Regex String

But there's no particular reason a regex has to be a string. The perl module Regexp::English lets you construct Regex by chaining method calls,

$integer = Regexp::English
            -> start_of_line
            -> optional
               -> literal( '-' )
               -> end
            -> multiple
               -> digit
               -> end

Once the compiler gets to it, there's some internal parse-tree representation.

data Regex = Regex [RegexAtom]

data RegexAtom
  = Literal String
  | CharClass String
  | Multiple Int Int RegexAtom
  | ...

We could certainly represent it like that, and create a virtual machine that interprets [RegexAtom]. That may be how you'd implement a more serious regex engine, but not that good of an approach if you're using it as springboard to understand monads. And it's not what a regex actually is, i.e. a function.

                .---------.
"some text" --> | /regex/ | --> match?
                '---------'

Which is how most non-perl languages represent it,

// C#, System.Text.RegularExpressions

var re = new Regex( "needle" );
var match = re.Matches( "haystack" );

And with all the tools a functional language provides that should be easy to work with,

data Regex = Regex (String -> Match)

To finish the definition, what's the result of executing a regex? In the simplest case, it can fail to match:

data Match = NoMatch

A successful match needs to include the remainder of the input (so that it can be composed with a following fragment of regex); the result of the match (in whatever format the match provides); and finally other ways the match could have happened (to backtrack and retry if a later match fails).

data Match r
    = NoMatch
    | Match String r (Match r)

Which looks familiar,

data List a
    = Nil
    | Cons a (List a)

Leaving the final type,

data Regex r = Regex (String -> [Match r])

data Match r = Match String r

runRegex (Regex re) = re

Regex as a Monad

Monads are defined by the >>= (a.k.a. bind) and return functions,

instance Monad Regex where
  (>>=)   = bindRegex
  return  = returnRegex

Return gives some arbitrary value the properties that makes the type a monad. For a regex, that's matching against a string, consuming the matched portion, and listing the other ways it could have matched for backtracking. For the no-op of return, we'll always match, leave the input untouched, and yield no other options to backtrack to.

returnRegex a = Regex $ \s -> [Match s a]

Bind takes two computations (the second of which may depend on the first) and join them together. So given two regex /a/ and /b/ what's the meaining of the combined regex /ab/? /b/ should only match against the remaining string that /a/ didn't consume. And if /b/ doesn't match, it backtracks to try another way /a/ could have matched.

The list monad reduces that to an elegant two lines: for each way the first regex can match, attempt to match the second starting from that point. Lazy evaluation takes care of the rest leaving the backtrack cases unevaluated until needed.

bindRegex re'a fre'b = Regex $ \s -> do
  -- For each way re'a can match s
  (Match s' a) <- runRegex re'a s
  -- Match re'b against the remaining text s'
  runRegex (fre'b a) s'

And then fill in the gaps; Monad implies Applicative and Functor. Both can trivially be derived using Control.Monad.

instance Functor Regex where
  fmap = liftM

instance Applicative Regex where
  pure  = return
  (<*>) = liftM2 ($)

A regex can also not match at all, and having the MonadFail instance will save some trouble later. An unconditional failure is a regex that never matches, ignores its input, and has no other ways to match.

instance MonadFail Regex where
  fail s = Regex $ const []

Using the Regex Monad

Defining Atoms

Matching a single character consumes the next head of the input string. If there's nothing left in the input string it trivially can't match. If it's a character we want, that's a match and continue on with the remainder of the string.

anyChar :: Regex Char
anyChar = Regex anyChar'
  where
    anyChar' []     = []
    anyChar' (x:xs) = [MatchResult xs x]

literal :: Char -> Regex Char
literal c = Regex literal'
  where
    literal' []   = []
    literal' (x:xs)
      | x == c    = [MatchResult xs c]
      | otherwise = []

charClass :: [Char] -> Regex Char
charClass cc = Regex charClass'
  where
    charClass' []     = []
    charClass' (x:xs)
      | x `elem` cc   = [MatchResult xs x]
      | otherwise     = []

And that looks like a pattern, abstract it away,

matchChar :: (Char -> Bool) -> Regex Char
matchChar p = Regex matchChar'
  where
    matchChar' []     = []
    matchChar' (x:xs)
      | p x           = [Match xs x]
      | otherwise     = []

anyChar :: Regex Char
anyChar = matchChar $ const True

literal :: Char -> Regex Char
literal = matchChar . (==)

charClass :: [Char] -> Regex Char
charClass = matchChar . (flip elem)

inverseCharClass :: [Char] -> Regex Char
inverseCharClass = matchChar . (flip notElem)

digit = matchChar isDigit
whitespace = matchChar isSpace

-- and so on...

Which is enough to start using the library.

Matching

Most of the time we ask if a regex matches, we just want the first time it matches anywhere in the string. I.e., try it starting at the beginning and advance to the end until a match is found.

findFirst [] re = case runRegex re [] of
  []              -> Nothing
  ((Match _ r):_) -> Just r
findFirst xs re = case runRegex re xs of
  []              -> findFirst (tail xs) re
  ((Match _ r):_) -> Just r

Alternatively, some libraries have a formulation that behaves like /^a$/, only returning a match if the entire string is consumed. No need to retry at later positions in the string. But we do need to assert that the end of the input has been reached.

lookAhead :: Regex String
lookAhead = Regex $ \s -> [Match s s]

matchEnd :: Regex ()
matchEnd = do
  s <- lookAhead
  if null s
    then return ()
    else fail ""

findExact xs re = case runRegex re' xs of
    []                    -> Nothing
    ((Match _ r):_) -> Just r
  where
    re' = do
      r <- re
      matchEnd
      return r

*Main> findFirst "a1b2" $ sequence [ digit, anyChar ] Just "1b" *Main> findFirst "a1b2" $ sequence [ digit, literal 'a' ] Nothing

Quantified Atoms and Metacharacters

/a|b/

One last poke at the internals of the Regex type is required here to define the backtracking logic when two or more regex fragments may match. This matches the first fragment of regex, if backtracking exhausts all the ways it could match, attempts to match the next fragment.

alternate :: [Regex a] -> Regex a
alternate res = Regex $ \s -> concatMap (flip runRegex s) res

/a?/

And that's all that's needed to implement the quantifiers. /a?/ can either match a, or nothing at all, i.e. /a|/.

optional :: Regex a -> Regex (Maybe a)
optional re = alternate [liftM Just $ re, return Nothing]

/a*/

/a*/ can be described as matching /a?/ until an empty match. And /a+/ transformed to /aa*/.

many :: Regex a -> Regex [a]
many re = do
  r <- optional re
  case r of
    Nothing -> return []
    Just r' -> liftM (r':) $ many re

many1 :: Regex a -> Regex [a]
many1 re = do
  r <- re
  rs <- many re
  return $ r : rs

Haskell Hurts Your Brain

Regex Regex

The justification of Regex a earlier was a bit of a hand-wave. Isn't the result of a Regex match always just the portion of the string matched? Sure you could think of saving some time by parsing and validating within the regex. Or the Perl6/Raku functionality of embedding logic in the middle of the regex,

octet :: Regex Int
octet = do
  digs <- many1 digit
  let num = read digs
  if num < 256
    then return num
    else fail "value out of range"

ipAddress = do
  a1 <- octet
  literal '.'
  a2 <- octet
  literal '.'
  a3 <- octet
  literal '.'
  a4 <- octet
  return (a1,a2,a3,a4)

*Main> findExact "127.0.0.1" ipAddress Just (127,0,0,1) *Main> findExact "256.0.0.1" ipAddress Nothing

A parser has the same type we've given Regex, taking a string and producing some type of structure. Regular expressions can parse any regular grammar, for instance, regular expressions. There's no reason why the a in Regex a couldn't be Regex right?

compile :: String -> Regex (Regex String)

The Haskell implementation will look very similar to the grammar as if we were writing any other parser. Only instead of building up a syntax tree or some other intermediate structure, the parser will return fragments of ready-to-execute Regex code.

regex           ::= qualified-atom+
                  | qualified-atom+ '|' regex

qualified-atom  ::= atom qualifier?

atom            ::= char-class
                  | group
                  | escape
                  | any-char
                  | literal

qualifier       ::= '?'
                  | '*'
                  | '+'
                  | '{' digit+ '}'
                  | '{' digit+ ',' digit* '}'

Nothing unusual with the atoms. The various regex symbols are translated into the single-character matchers defined above. To keep the types consistent liftM (:[]) turns the character-matchers into string-matchers.

compileAtom :: Regex (Regex String)
compileAtom = do
  ch <- anyChar
  case ch of
    '['  -> compileCharClass
    '('  -> do
      re <- compileRegex
      literal ')'
      return re
    '\\' -> compileEscape
    '.'  -> return $ liftM (:[]) anyChar
    _    -> if not $ isAlphaNum ch
              then fail "Illegal metacharacter"
              else return $ liftM (:[]) $ literal ch

compileCharClass :: Regex (Regex String)
compileCharClass = do
  cc <- many $ matchChar (/=']')
  literal ']'

  case cc of
    []        -> fail "Empty character class"
    ['^']     -> return $ liftM (:[]) $ literal '^'
    ('^':cc') -> return $ liftM (:[]) $ inverseCharClass cc'
    _         -> return $ liftM (:[]) $ charClass cc

compileEscape :: Regex (Regex String)
compileEscape = do
  esc <- anyChar
  case esc of
    'd'       -> return $ liftM (:[]) $ matchChar isDigit
    'w'       -> return $ liftM (:[]) $ matchChar isAlphaNum
    's'       -> return $ liftM (:[]) $ matchChar isSpace
    'D'       -> return $ liftM (:[]) $ matchChar $ not . isDigit
    'W'       -> return $ liftM (:[]) $ matchChar $ not . isAlphaNum
    'S'       -> return $ liftM (:[]) $ matchChar $ not . isSpace
    'x'       -> do
        digs <- exactlyN 2 $ matchChar isHexDigit
        return $ liftM (:[]) $ literal $ chr $ read $ '0':'x':digs
    _ | isAlphaNum esc  -> fail "Undefined escape sequence"
      | otherwise       -> return $ liftM (:[]) $ literal esc

The quantifiers really show off functional programming's strengths. The input is parsed to its literal meaning as a modifier on a fragment of regex, Regex String -> Regex String.

compileQuantifier :: Regex (Regex String -> Regex String)
compileQuantifier = do
  quant <- optional $ charClass "?*+{"
  case quant of
    Nothing   -> return id
    Just '?'  -> return $ (liftM $ fromMaybe "") . optional
    Just '*'  -> return $ (liftM concat) . many
    Just '+'  -> return $ (liftM concat) . many1
    Just '{'  -> do
      min   <- liftM read $ many1 digit
      comma <- optional $ literal ','
      -- By making is illegible, we avoid the need for an if or case
      mmax  <- sequence $ flip fmap comma $ const $ many digit
      literal '}'

      -- Match the minimum number of times
      let matchN = (exactlyN min)

      -- Match the variable part
      let matchM = case mmax of
            -- No comma, fixed quantity
            Nothing   -> const $ return []
            -- Comma, no max limit
            Just []   -> many
            -- Maximum matches
            Just max' -> atMostN $ read max'

      return $ \re -> do
        r1 <- matchN re
        r2 <- matchM re
        return $ (concat r1) ++ (concat r2)

Apply the quantifier to the atom. And then bind all those individual bits into the the complete regex.

compileRegex :: Regex (Regex String)
compileRegex = do
    res <- compileRegex'
    return $ alternate res
  where
    compileRegex' :: Regex [Regex String]
    compileRegex' = do
      re  <- many1 compileQuantifiedAtom
      let re' = liftM concat $ sequence re
      alt <- optional $ literal '|'
      if isNothing alt
        then return [re']
        else liftM (re':) compileRegex'

compileQuantifiedAtom :: Regex (Regex String)
compileQuantifiedAtom = do
  atom <- compileAtom
  quant <- compileQuantifier
  return $ quant atom

We're Perl now,

t =~ re = findFirst t $ fromJust $ findExact re compileRegex

*Main> "Ping statistics for 23.239.13.224:" =~ "\\d+(\\.\\d+){3}" Just "23.239.13.224" *Main> "Packets: Sent = 4, Received = 4, Lost = 0 (0% loss)," =~ "\\d+(\\.\\d+){3}" Nothing

Next Steps

From here on, it's mostly just adding state and threading it through the bind function. Which in turn is mostly just re-implementing the State Monad to allow backtracking. Nothing particularly interesting from a monad perspective (in fact, my implementation changes little in the code above other than to replace Regex a with Regex s a.

There are some interesting things you can do with type classes to transparently support numbered captures vs. named captures. Or allow the regex to match against a stream of anything not just strings.

class RegexState'Basic s a where
  peek'input  :: a -> Maybe s
  take'input  :: a -> (Maybe s, a)
  at'start    :: a -> Bool
  at'end      :: a -> Bool

class RegexState'Capture k g a where
  set'capture :: k -> g -> a -> a
  get'capture :: k -> a -> Maybe g

Maybe some other time.