Saturday, January 20, 2007

Fox's Ubiquitous Free Derivative pt. 2

As promised earlier, here are some Haskell definitions of derivatives that can be applied to regular expressions. They all give operators that might occasionally be of practical use. But what's more interesting to me is that we now have enough of an intuition pump going between types and regular expression that we can siphon off ideas from papers on types and use them with regexps.

First define a regular expression type:

> import List
> data RE a = Symbol a | Star (RE a) | Sum [RE a] | Prod [RE a]
> deriving (Eq,Show,Ord)

I use lists instead of binary operators to represent products and sums because
addition and multiplication of regular expressions is associative.

> a <*> b = Prod [a,b]
> a <+> b = Sum [a,b]

> zero = Sum []
> one = Prod []

Here's how to make a simple regular expression from a string

> re x = Prod $ map Symbol x

Exercise: Write a proper parser for these things so we can use conventional regexp notation!

Here's a test to see whether a regular expression matches the empty string:

> acceptsEmpty (Symbol _) = False
> acceptsEmpty (Star _) = True
> acceptsEmpty (Sum l) = or $ map acceptsEmpty l
> acceptsEmpty (Prod l) = and $ map acceptsEmpty l

I've decided to name these derivatives after people. So first I define the McBride derivative which is a generalisation of the dissection operator introduced in Clowns to the left.... The key idea is that if d = mcbride c j a then d (x <*> y) = (d x <*> j y) <+> (c x <*> d y). This needs to respect the associativity of <*> and one way to achieve this is by choosing c and j that are multiplicative homomorphisms so that, for example, c(x <*> y) = c x <*> c y.

> mcbride _ _ a (Symbol a') = if a==a' then one else zero
> mcbride c j a (Sum p) = Sum $ map (mcbride c j a) p
> mcbride _ _ _ (Prod []) = zero
> mcbride c j a (Prod x) = Sum $ map f (splits x) where
> f (x,y:ys) = Prod $ map c x ++ [mcbride c j a y] ++ map j ys
> mcbride c j a (Star p) = c (Star p) <*> mcbride c j a p <*> j (Star p)

And now I can define all the Leibniz, Fox and Brzozowski derivatives as special cases of the mcbride derivative. The latter each come in left- and right-handed pairs.

> delta p = if acceptsEmpty p then one else zero

> leibniz,fox_l,fox_r,brzozowski_l :: (Ord a,Eq a) => a -> RE a -> RE a
> leibniz a p = simplify $ mcbride id id a p
> fox_l a p = simplify $ mcbride id (const (one)) a p
> fox_r a p = simplify $ mcbride (const (one)) id a p
> brzozowski_l a p = simplify $ mcbride delta id a p
> brzozowski_r a p = simplify $ mcbride id delta a p

brzozowski_l is the usual Brzozowski derivative. If r matches some language, brzozowski_l a t matches the sublanguage of strings beginning with a and with that a removed. We can define regular expression matching in the usual way in terms of it:

> matches re s = acceptsEmpty (matches' re s) where
> matches' re [] = re
> matches' re (a:as) = matches' (brzozowski_l a re) as

Now we can start matching.

ex1 would normally be written as "abc|b*c".

> ex1 = (re "abc") <+> (Star (re "b") <*> Symbol 'c')
> test1 = map (matches ex1) ["abc","bbc","bbbbc","abbbbc"]

brzozowski_r chews off a symbol from the right:

> test2 = map (matches (brzozowski_r 'c' ex1)) ["abc","bbc","ab","bbbbb"]

leibniz a allows you to leave out one a from anywhere in the regexp.

ex3 matches "abracadabra" as it might be shouted by a wizard who might lengthen the second last 'a' to make sure the spirits can hear. ex4 matches such strings with precisely one missing 'r'.

ex3 would normally be written as "abracada*bra".

> ex3 = re "abracad" <*> Star (re "a") <*> re "bra"
> test3 = map (matches ex3) ["abracadabra","abracadaaaaabra"]
> ex4 = leibniz 'r' ex3
> test4 = map (matches ex4) ["abracadabra","abacadabra","abracaaaaadaba","abacadaba"]

fox_l a makes regular expressions accept any prefix ending just before an a. Combining fox_l and fox_r allows you to search for all elements of a language just between a chosen pair of symbols.

In ex5 we build a regexp for certain types of statement in a simple language and then pull out from it the part corresponding to substrings within parentheses. The undifferentiated part of ex5 would normally be written as "x = ([0-9]+[0-9]); *".

> ex5 = simplify $ fox_r '(' $ fox_l ')' $
> let digits = Star (Sum (map Symbol "0123456789")) in
> re "x = (" <*> digits <*> re "+" <*> digits <*> re ");" <+> Star (re " ")
> test5 = map (matches ex5) ["1+2","7-4"]

Open Question: There are more derivatives that can be constructed from the parts I've given. What do they do? What other multiplicative homomorphisms can you define? What derivatives do they give? (Hint: consider the property "R matches only a finite language").

Oh...I almost forgot the whole point of this. It's now possible to see that McBride's l-operator is a version of the Brzozowski derivative. In fact, the theorems he proves about l are pretty similar to the theorems that get proved about the Brzozowski derivative such as Theorem 4.4 here.

Some helper functions:

splits x returns all the ways of splitting x into pairs u and v such that u ++ v == x and v is non-empty.

> splits [] = []
> splits (a:as) = ([],a:as) : map f (splits as) where
> f (x,y) = (a:x,y)

Some (incomplete) simplification rules for regular expressions

> simplify a = let b = simplify' a in if a==b then b else simplify b where
> simplify' (Prod [a]) = a
> simplify' (Prod a) | zero `elem` a = zero
> simplify' (Prod (a:b)) = case a of
> Prod x -> Prod $ x ++ map simplify b
> a -> Prod $ filter (/= one) $ map simplify (a:b)
> simplify' (Sum [a]) = a
> simplify' (Sum (a:b)) = case a of
> Sum x -> Sum $ x ++ map simplify b
> a -> Sum $ nub $ sort $ filter (/= zero) $ map simplify (a:b)
> simplify' (Star a) = Star (simplify a)
> simplify' a = a

Labels: ,


Post a Comment

<< Home