Sunday, February 18, 2007

Modular arithmetic with regular expressions

Problem: Find a regular expression, compatible with GNU grep, that recognises strings of 1's and 0's that form the binary expansion of multiples of 7. (Consider the empty string to be zero.)

Solution:
Here's one (though not necessarily a good one):

^\(0\|1\(1\(01*00\)*01*010\|\(0\|1\(01*00
\)*01*011\)\(10\(01*00\)*01*011\|\(0\|11\
)1\)*\(10\(01*00\)*01*010\|\(0\|11\)0\)\)
*\(1\(01*00\)*1\|\(0\|1\(01*00\)*01*011\)
\(10\(01*00\)*01*011\|\(0\|11\)1\)*10\(01
*00\)*1\)\)*$


The standard approach to solving this problem is probably to construct a finite state automaton to compute the remainder modulo 7 of a binary string and then use the proof of Kleene's theorem to convert the automaton to a regular expression. But as I have some regular expression code left over from a previous post I thought I'd try a purely algebraic approach. Although it's ultimately equivalent to deriving a finite state automaton I thought it might be more fun to use intuitions about linear algebra to get to the solution. And anyway, drawing diagrams with HTML is hard...

Consider first recognising multiples of 3. Assume we have three regular expressions, A, B and C. A corresponds to multiples of 3, B corresponds to numbers that are 1 modulo 3 and C matches numbers that are 2 modulo 3. Then we know

A = A0+B1+1

B = C0+A1

C = B0+C1


First a note about notation: I'm using bold 0 and 1 to represent binary digits but using regular 0 to represent the regular expression matching nothing and regular 1 to represent the regular expression matching the empty string. a+b means match a or b. So we can interpret the equation for A, say, as saying that a multiple of 3 is the empty string, or a multiple of 3 followed by a 0, or a 1 after one more than a multiple of 3. Similar interpretations apply for the other two equations. We can rewrite this as

x = xM+v

where

M=
0 1 0
1 0 0
0 0 1

and

x = [A B C], v = [1 0 0]


We can now use our ordinary intuitions about solving linear systems. We want to solve for A so we need to eliminate B and C. Start by eliminating C. We have

C = B0+C1


If we were doing ordinary linear algebra we could say

C(1-1) = B0

C = B0(1-1)-1

but for regular expressions we can neither subtract nor divide. However, arguing informally now, (1-a)-1=1+a+a²+a³+…=a* so now we can write

C = B01*


Anyway, those are all the principles we need. Just eliminate each variable in turn by solving for it in terms of the other remaining variables and substituting the solution back into the remaining equations. You should be able to finish the job and solve for A. But you'll notice the work gets tedious for larger integers. That's where the code comes in.

It starts as before but we can delete a whole lot of material:


> import List
> data RE a = Symbol a | Star (RE a) | Sum [RE a] | Prod [RE a]
> deriving (Eq,Show,Ord)

> a <*> b = Prod [a,b]
> a <+> b = Sum [a,b]

> zero = Sum []
> one = Prod []


My simplification rules are still really poor. (Buggy, in fact, but not buggy in a way that leads to incorrect results.) That means we don't expect the final regular expressions to be in any way efficient. But I'm not trying to get the best solution, just a solution.


> simplify a = let b = simplify' a in if a==b then b else simplify b where
> simplify' (Prod [a]) = simplify a
> simplify' (Prod a) | zero `elem` a = zero
> simplify' (Prod (a:b)) = case simplify a of
> Prod x -> Prod $ x ++ map simplify b
> a -> Prod $ filter (/= one) $ map simplify (a:b)

> simplify' (Sum [a]) = a
> simplify' (Sum (a:b)) = case simplify a of
> Sum x -> Sum $ x ++ map simplify b
> a -> Sum $ nub $ sort $ filter (/= zero) $ map simplify (a:b)

> simplify' (Star (Sum [])) = one
> simplify' (Star (Prod [])) = one
> simplify' (Star a) = Star (simplify a)
> simplify' a = a


I'm representing the vector v and matrix M above as simple associative arrays whose elements are found using lookup. Elements not represented are assumed to be 0. fetch handles the sparse lookup:


> type Matrix s a = [((s,s),RE a)]
> type Vector s a = [(s,RE a)]

> fetch i x = maybe zero id $ lookup i x


solve implements exactly the solution method described above. The use of Star corresponds to my abuse of the binomial series for (1-a)-1 above. However rather than manipulate equations I'm always working in terms of matrices and vectors over the regular expressions:


> solve range i (m,v) = (
> [(j,simplify $ fetch (i,j) m <*> Star (fetch (i,i) m)) |
> j <- range, j/=i],
> simplify $ fetch i v <*> Star (fetch (i,i) m)
> )


substituteFor substitutes a solution to an equation given by solve into the remainder of the equations.


> substituteFor range k (v',c') m v = (
> [((i,j),(fetch j v' <*> fetch (i,k) m) <+> fetch (i,j) m) |
> i <- range, i/=k, j <- range, j/=k],
> [(i,(c' <*> fetch (i,k) m) <+> fetch i v) |
> i <- range, i /= k]
> )


As I describe above, a variable is eliminated by solving for it and then substituting:


> elim range s (m,v) = let (v',c') = solve range s (m,v)
> in substituteFor range s (v',c') m v



Now some code to output our regular expressions in a form compatible with grep:



> bracket n (s,n') = if n'>=n then s else "\\(" ++ s ++ "\\)"
> pretty (Symbol a) = ([a],9)
> pretty (Star a) = (bracket 8 (pretty a) ++ "*",8)
> pretty (Prod a) = (concat $ map (bracket 7) $ map pretty a,7)
> pretty (Sum a) = (foldr1 (\a b -> a ++ "\\|" ++ b) $
> map (bracket 6) $ map pretty a,6)


Now construct M and v for the equations describing the regular expressions for the classes of binary strings modulo n:


> m n = [(((2*i) `mod` n,i),Symbol '0')|i <- [0..n-1]] ++
> [(((2*i+1) `mod` n,i),Symbol '1')|i <- [0..n-1]]
> v n = [(0,one)]


To solve a complete set of equations we reduce it one equation at a time by eliminating variables:


> reduce [a] (m,v) = solve [a] a (m,v)
> reduce (a:as) (m,v) = reduce as $ elim (a:as) a (m,v)
> test n = let (_,r) = reduce [n-1,n-2..0] (m n,v n) in simplify r


Print regular expression to detect multiples of 7:


> main = putStrLn $ "^" ++ fst (pretty (test 7)) ++ "$"


Try integers besides 7. But note that the expressions get more and more complex. And of course you can use the equation solving approach for problems besides recognising multiples of integers.

PS I just noticed some of the indentation went wrong on translation to HTML. But the code appears to still compile fine with GHC.

7 comments:

  1. If I may nitpick (and this is not to detract from your wonderful articles), what grep accepts is a superset of regular languages, since it can match things like (.*)(\1) which are strings of the form ww

    or am I mistaken about grep.

    ReplyDelete
  2. (I think my previous comment disappeared during submission)

    nitpick
    doesn't grep accept a superset of regular languages, because of the ability to use the \1 operator ?

    /nitpick

    ReplyDelete
  3. What you say is true. But that doesn't detract from having solved the problem I set myself. Now Perl regular expressions might be different, I think you may be able to directly embed arithmetical expressions in those.

    ReplyDelete
  4. This doesn't work for 2?

    % runhaskell mod.lhs
    ^\(0\|01*1\)*$
    % grep '^\(0\|01*1\)*$'
    0
    0
    10
    ^D
    %

    ReplyDelete
  5. Anonymous,

    I don't have time to check right now yet but that looks like a regexp for spotting multiples of two written backwards. I wrote two versions of this code and I think I may have posted the backwards one. Oops. that's embarassing! I'll check and fix later :-)

    ReplyDelete
  6. Anonymous,

    Yup, it was reversed. I've already fixed that once, I must have accidentally undone the changes in my text edit. Thanks for alerting me! Anyway, it's now fixed.

    ReplyDelete
  7. This is a fun problem. Thanks!

    Here's my regexp. I wrote an automaton first and then sort of hand-crafted the regexp from it rather than went by Kleene's theorem (to make it less ugly and more economical):


    ^(0|1((0(01|111)*(00|110))*(1|0(01|111)*10))
    (01*0(1(10|000)*(11|0(000)*01))*
    (0|1(10|000)*0(000)*1))*1)*$

    ReplyDelete