Saturday, January 24, 2009

Fast incremental regular expression matching with monoids

The Problem


Consider this problem: Fix a regular expression R. Suppose you have a string of length N. There's not much you can do about it, you'll likely have to scan all N characters to test so see if the string matches R. But once you've performed the test, how fast can you test the string again if a small edit is made to it? It seems that in general you'd have to rescan the entire string, or at least rescan from where the edit was made. But it turns out that you can do regular expression matching incrementally so that for many changes you might make to the string, you only require O(log N) time to recompute whether the string matches. This is true even if characters at opposite ends of the string interact to make a succcessful match. What's more, it's remarkably straightforward to implement if we make use of fingertrees and monoids.

I'm going to assume a bit of background for which resources can be found on the web: understanding some basics about monoids, understanding apfelmus's inspiring introduction to fingertrees or the original paper, and you'll need to be completely comfortable with the idea of compiling regular expressions to finite state machines.

As usual, this is literate Haskell. But you need to have the fingertree package installed and we need a bunch of imports.


> {-# LANGUAGE TypeSynonymInstances,FlexibleInstances,MultiParamTypeClasses #-}
> import qualified Data.Array as B
> import Data.Array.Unboxed as U
> import Data.Foldable
> import Data.Monoid
> import Data.FingerTree hiding (fromList)
> import qualified Data.List as L


A Finite State Machine


So let's start with an example regular expression: .*(.*007.*).*. We're looking for "007" enclosed between parentheses, but the parentheses could be millions of characters apart.

A standard technique for finding regular expressions is to compile them to a finite state automaton. It takes quite a bit of code to do that, but it is completely standard. So rather than do that, here's a finite state machine I constructed by hand for this regular expression:

I've used the convention that an unlabelled edge corresponds to any input that isn't matched by another labelled edge. We can express the transitions as a function fsm.


> fsm 0 '(' = 1
> fsm 0 _ = 0

> fsm 1 '0' = 2
> fsm 1 _ = 1

> fsm 2 '0' = 3
> fsm 2 _ = 1

> fsm 3 '7' = 4
> fsm 3 '0' = 3
> fsm 3 _ = 1

> fsm 4 ')' = 5
> fsm 4 _ = 4

> fsm 5 _ = 5


The initial state is 0 and a match corresponds to state 5.

We can test a string in the standard way using


> matches s = Prelude.foldl fsm 0 s==5


Try matches "(00 7)" and matches "He(007xxxxxxxxxxxx)llo".

We can thing of the inputs as being functions acting on the automaton. Each input character is a function that maps the automaton from one state to another. We could use the Haskell composition function, (.), to compose these functions. But (.) doesn't really do anything, f . g is just a closure that says "when the time comes, apply g and then f.

On the other hand we could tabulate our transition functions as follows:


> tabulate f = array (0,5) [(x,f x) | x <- range (0,5)]


We have one such tabulated function for each letter in our alphabet:


> letters = array (' ','z') [(i,tabulate (flip fsm i)) | i <- range (' ','z')]


Given two tabulated functions we can easily form the table of the composition function. In fact, our tabulated functions form a monoid with mappend for composition. I used unboxed arrays for performance:


> type Table = UArray Int Int

> instance Monoid Table where
> mempty = tabulate id
> f `mappend` g = tabulate (\state -> (U.!) g ((U.!) f state))


Note that we've cheated a bit. An object of type Table could be any array of Ints indexed by Int. But if we promise to only build arrays indexed by elements of [0..5] and containing elements of the same range then our claim to monoidhood is valid.

Given any string, we can compute whether it matches our regular expression by looking up the corresponding Table in our letters array, composing them, and then checking if the tabulated function maps the initial state 0 to the final state 5:


> matches' s = table!0==5 where
> table = mconcat (map ((B.!) letters) s)


This is slower and more complex than our original implementation of matches. But what we've now done is 'tease out' a monoid structure from the problem. If we store a string as a sequence of characters represented by a fingertree, we can store in each subtree the element of Table corresponding to the substring it represents. Every time the tree is rebalanced we need to recompute the corresponding Tables. But that's fine, it typically involves only O(log N) operations, and we don't need to write any code, the fingertree will do it for us automatically. Once we've done this, we end up with a representation of strings with the property that we always know what the corresponding Table is. We can freely split and join such trees knowing that the Table will always be up to date.

The only slight complication is that I want to be able to randomly access the nth character of the tree. apfelmus explains that in his post. The change I need to make is that I'm going to use both the Size monoid and the Table monoid, so I need the product monoid.


> data Elem a = Elem { getElem :: a } deriving Show
> data Size = Size { getSize :: Int } deriving (Eq,Ord,Show)

> instance Monoid Size where
> mempty = Size 0
> Size m `mappend` Size n = Size (m+n)


We need to implement measure as in the fingertree paper:


> instance Measured (Size,Table) (Elem Char) where
> measure (Elem a) = (Size 1,(B.!) letters a)


A Fingertree


And now we can define our strings as:


> type FingerString = FingerTree (Size,Table) (Elem Char)


The insertion routine is more or less what's in the paper:


> insert :: Int -> Char -> FingerString -> FingerString
> insert i c z = l >< (Elem c <| r) where (l,r) = split (\(Size n,_) -> n>i) z


Note how I project out the size from the product monoid in order to insert at the correct position.

Here's an example string. Adjust the length to suit your memory and CPU horsepower:


> fromList = L.foldl' (|>) empty
> string = fromList (map Elem $ take 1000000 $ cycle "the quick brown fox jumped over the lazy dog")


(I use a strict form of fromList to ensure the tree actually gets built.)

The actual match function simply projects out the second component of the monoid and again tests to see if it maps the initial state to the final state:


> matches007 string = ((U.!) (snd (measure string)) 0)==5


An Interactive Loop


I recommend compiling with optimisation, something like ghc --make -O5 -o regexp regexp.lhs:


> loop string = do
> print $ "matches? " ++ show (matches007 string)
> print "(Position,Character)"
> r <- getLine
> let (i,c) = read r
> loop $ insert i c string

> main = do
> loop string


Now you can run this interactively. Input values like (100,'f') to insert an 'f' at position 100. It can take a good few seconds to compute the initial tree, but after that the matching process is instantaneous. (Actually, the second match might take a few seconds, that's because despite the foldl' the tree hasn't been fully built.)

A suitable sample input is:

(100,'(')
(900000,')')
(20105,'0')
(20106,'0')
(20107,'7')


Discussion


Note there is quite an overhead for this example. I'm storing an entire Table for each character. But you can easily store chunks of string (like in a rope). This means that some chunks will be rescanned when a string is edited - but rescanning a 1K chunk, say, is a lot less expensive than scanning a gigabyte file in its entirety. Working in blocks will probably speed up the initial scan too, a much smaller tree needs to be built.

When Hinze and Patterson originally wrote the fingertree paper they were motivated by parallel prefix sum methods. Just about any parallel prefix algorithm can be converted to an incremental algorithm using fingertrees. This article is based on the idea of doing this with the parallel lexing scheme described by Hillis and Steele in their classic Connection Machine paper.

So why would you want to match against a fixed regular expression like this? Well this method extends to a full blown incremental lexer. This will lex quickly even if placing a character in a string changes the type of lexemes billions of characters away. See the Hillis and Steele paper for details.

Note there's nothing especially Haskelly about this code except that Haskell made it easy to prototype. You can do this in C++, say, using mutable red-black trees.

10 comments:

apfelmus said...

Great post, as usual. :-)

In other words, deterministic finite state automata and thus regular expressions are actually monoid morphisms from the free monoid over the alphabet to a finite monoid, namely the monoid of endomorphisms on the states.

I finally remember this connection from a past automata & languages lesson. For instance, the converse is true as well, every homomorphism to a finite monoid gives rise to a finite state automaton. (Construction: take the monoid itself as state space, every element corresponds to an endomorphism via left multiplication). I think this can and is used to minimize finite state automata.

apfelmus said...

Great post, as usual. :-)

In other words, deterministic finite state automata and thus regular expressions are actually monoid morphisms from the free monoid over the alphabet to a finite monoid, namely the monoid of endomorphisms on the states.

I finally remember this connection from a past automata & languages lesson. For instance, the converse is true as well, every homomorphism to a finite monoid gives rise to a finite state automaton. (Construction: take the monoid itself as state space, every element corresponds to an endomorphism via left multiplication). I think this can and is used to minimize finite state automata.

sigfpe said...

apfelmus,

Fingertrees basically cache homomorphisms from the free monoid.

There's an extra twist you can squeeze out of this formalism. You can label the transitions of the automaton with output values that live in a monoid. Then you can do things like counting and minimisation while matching - eg. to count the number of lexems, or find the longest one, or find the longest string between a certain pair and so on.

Anonymous said...

Minor nitpick (and I could be wrong): doesn't the self loop of the 3rd node of your fsm allow it to actually match /.*(.*00*7.*).*/?

Aaron Denney said...

apflemus: I don't suppose you remember how it can be used to minimize FSAs?

Edward Kmett said...

Anonymous, nope. I think you meant + not *. It is needed in order to handle cases where you have more than two zeros preceding the 7.

It lets you match:

/.*(.*000*7.*).*/?

but that matches the same inputs as

/.*(.*007.*).*/?

because

.*000* and .*00 match exactly the same inputs.

If the 0* in .*000* matched any characters then we know that the .* could have munched more characters.

If the .* in .*00 matches something that ends in a run of 0's then .*000* could match all but the first two in the 0* portion.

leithaus said...

Dan,

i haven't read the fingertree papers, yet. Is there an account of how the finger tree representation/analysis relates to quantaloids?

Best wishes,

--greg

jkff said...

Dan, I just published an article that expands on the ideas from this blogpost and builds a Java library for incremental matching of multiple regular expressions: http://jkff.info/articles/ire (it was actually published in the Russian functional programming journal http://fprog.ru/ but I recently got around to finish the translation)

Alec Taylor said...

Interesting results.

Two things:

1. Can you do some benchmarks against the Rope implementation? - http://jkff.info/articles/ire

2. You mention Red/Black trees for C++ at the end there. Wouldn't that adversely affect the complexities? - Why wouldn't you use Finger Trees in C++? (there is a Masters theses floating around on the interwebs with an implementation of the 2-3 tree variant)

Michael Roger said...

"We can thing of the inputs " typo

Blog Archive