## Saturday, June 30, 2007

### Monads from Algebra and the the Gray Code from Groups

There is a close association between algebraic structures and monads. I've mentioned this in passing before but I think it's interesting to work out some details. Among other things, it gives us a way of converting your favourite algebraic structures into domain specific languages embedded in Haskell.

First an aside. Haskell monads are in some sense only an approximation to mathematical monads. In fact, Haskell is only an approximation to mathematics. It's easy to define a Haskell function f, say, such that x == y but f x /= f y. Once you do such a thing you can no longer reason about Haskell functions with the assumption that == represents equality in the mathematical sense. (For an example, see Saizan's comment here.) So in the following I'm going to assume that we've limited ourselves to functions such that x == y implies that f x == f y, x < y and y < z imply x < z and so on. I'd love to see some way, in Haskell, to make explicit 'promises' about the properties of functions. (I guess that's what they call programming by contract.) But for the following we'll just assume it holds.

The first abstract algebraic structure that people study is usually the group, but I'm going to start with something simpler, the monoid. A monoid is a set M (called the underlying set of the monoid) with a binary operator, ·:M×M→M and identity element e such that e·x = x·e = x and x ·(y·z) = (x·y)·z for all x,y,z in M. By abuse of notation we'll often use M for the name of the monoid as well as the underlying set. Monoids are easy to come by and there are lots of examples. Any time you have a sequence of operations that can be applied, or not, in any order, to some system, you have a monoid. Sequences of operations form the underlying set of the monad and the binary operator means "do what's on the right" followed by "do what's on the left". For example the set of uniform scaling operations we can perform on a circle form a monoid. Write a scaling by a factor of x as s(x). If we double the size of a circle, and then triple it we have s(2) followed by s(3), written s(3)·s(2). Note that we have s(m)·s(n) = s(mn) and that s(1)·s(m)=s(m)·s(1)=s(m) so s(1) plays the role of the identity, e. Also note that for all m apart from 0 we have s(m)·s(1/m)=e so every scaling apart from s(0) has in inverse. Note however that this monoid is special because for all x and y in it, x·y=y·x. This doesn't always hold in monoids and when it does hold our monoid is said to be commutative.

If we have a type of algebraic structure then we can often form "free" versions of that structure. For monoids we get free monoids. There are many different ways of looking at "freeness" and I'm going to go through some of them informally:

1. Given any monoid there are equations satisfied by its elements. From the above example we have that x·y=y·x and also that x·e·x=x·x. But notice how there is a big difference between these equations. The former doesn't hold in all monads, but the latter does. A monoid is said to be free when the only equations that hold follow from equations that hold in every monoid. You can think of a free monoid as being a generic monoid. It has no special properties above and beyond being a monoid. Given a set S, the free monoid generated by S is the smallest free monoid containing the set S and with no equations relating the elements of S. Write this monoid as FS. For example, suppose S = {x,y}. Then we know that e, x and y are all in FS. We also know that x·x, x·y, y·x and y·y are all in S. Importantly we know that all 4 of these elements are distinct because we know there can be no equations connecting them beyond those that define a monoid. In fact, it's not hard to see that the elements of FS are essentially just the (possibly zero length) strings of x's and y's.
2. Given a set S, consider its elements to be 'labels' for unknown variables. In fact, just use the elements of S as variables. Then we can also consider the free monoid generated by S to be the set of "monoid expressions" in these unknowns. By "monoid expression" we just mean functions of these variables that can be written using the language of monoids, ie. the symbols e and ·. If S = {x,y} then examples of such expressions are e, x, y, x·y and so on. I hope it's not hard to see that this is simply another description of the same thing as in the previous paragraph.
3. Another way to think of free monoids requires a tiny bit more algebra. Define a monoid homomorphism from one monoid, M, to another, N, to be a function f:M→N on the underlying sets such that f(e)=e and f(x·y)=f(x)·f(y). A bijective monoid homomorphism is called an isomorphism. If there is an isomorphism between two monoids then in some sense they are the same monoid. Note that e and · are being overloaded here - we have two different monoids and these symbols have different meanings in each one. Now we can define the free monoid generated by S to be the monoid, M, such that (1) there is a function i:S→M (2) given any monoid N, and any function f:S→N, then f can be factored as f' o i where f' is a monoid homomorphism.
4. Very vaguely, the set S lives in the category of sets. But FS lives in the category of monoids. FS is the closest approximation to S in the category of monoids. The previous property gives a sense of what this means: anything that can be said about a function from S to a monoid N can be said in the language of monoids using a homomorphism from FS to N. Don't worry if this is too vague as I won't be using it below - but it may click in some people's minds.

The important thing here is that the operator F, ie. the "free monoid generated by" operator, forms a monad. It's tempting to think that monads got their name from monoids for this reason, but I don't think it is because just about any "free X generated by" operator forms a monad. Haskell monads give rise to DSLs via do-notation. So this means that algebraic structures give rise to monads, which may give rise to Haskell monads, and hence DSLs.

So now to explain why F forms a monad. Note that just about everything I say here works with other kinds of algebraic structures ranging from monoids through groups to vector spaces. Consider our set S={x,y} above. For convenience, let's drop the · and write the binary operator as multiplication in the usual way. Given an element of S we can easily get an element of FS. In fact, we have an embedding i:S→FS with i(x) = x. This is an abuse of notation. The element on the LHS is an element of S and the x on the RHS is a variable labeled with the symbol x, but we also write this as x because vx, or something like that, would be a pain in the ass to keep writing.

Now think about the elements of F(FS). These are strings of elements of FS, ie. strings of monoid expressions in x and y. We could write a typical member as (xyx)(e)(xxy) where the xyx, e and xxy are each elements of FS and I'm using parentheses to make clear which bits they are. It's tempting to say that this is simply xyxxxy, but that would be wrong. The former is an element of F(FS) and the latter is an element of FS. It would be clearer that these weren't equal if we used the v notation to write vxyxvevxxy. But in this case the oscurity is to our advantage. Even though (xyx)(e)(xxy) doesn't equal xyxxxy, the notation strongly suggests that we define a function that maps the first to the second. In fact, we can define a function m:F(FS)→FS that essentially just erases parentheses.

So we have functions i:S→FS and m:F(FS)→FS. Looks suspiciously like a monad. In fact, i and m satisfy the monad laws (exercise!) and make F into a monad.

In fact, if you repeat the above discussion with other algebraic structures that I and others have discussed you'll get other familiar monads (modulo details like being able to create infinite lists). Here's a table (with apologies for bad formatting caused by blogger.com):

 Algebraic Structure DSL Monoid Combinatorial search M-Set "Write-only" side effects Vector Space Probability theory/Quantum Mechanics Semirings Two player game tree search Modules over tropical semiring* Min-sum algorithm Group ???

(Note that the above table is approximate in the sense that sometimes you may need to restrict to instances of Eq and tweak the code slightly to make the Haskell monad behave like the mathematical one.)

Notice the gap in the "group" row. A group is a monoid in which every element has a left and right inverse. The monad is the "free group generated by" monad and I'll call it G. It doesn't correspond to any of the standard Haskell monads. So firstly - if free monoids give lists, what does the underlying datatype for free groups look like? Go back to our set S = {x,y}. GS contains e, x and y as well as all of the products of x and y that appear in FS. But additionally GS contains the inverses of x and y, x-1 and y-1. And of course you need all strings of x, y and their inverses. But do you need more? What about (xy)-1y? Well we can expand out the inverse using (xy)-1=y-1x-1. The net effect is that the free group contains precisely all (possibly empty) strings of x, y, x-1 and y-1, where substrings like xx-1 and x-1x are removed. We can model this in Haskell using the following type:
`> import Control.Monad> data Group a = G [Either a a] deriving Show`

We're using Left x to represent x and Right x to represent x-1. We implement >>= so that it uses the inverse of product formula above. It also ought to cancel out terms like xx-1 but to do that requires that we restrict the monad to instances of Eq and if I do that I risk castigation from Saizan. So I'll leave out the cancellation to get the following Haskell monad:
`> instance Monad Group where>     return x = G [Left x]>     G x >>= f = G \$ concatMap g x where>         g (Left y) = let G u = f y in u>         g (Right z) = let G u = f z in reverse (map (either Right Left) u)`

But what purpose might this serve? Try these:
`> test1 = sequence \$ replicate 4 (G [Left 0,Left 1])> test2 = sequence \$ replicate 4 (G [Left 0,Right 1])`

The first does something almost identical to what you might expect from the ordinary list monad. But note the ordering in the second. This is the Gray code monad! By using Left and Right we can control the direction in which future combinatorial searches are carried out. Admittedly not the most useful monad, but it's curious that we do get something recognisable from this abstract nonsense. And maybe you spotted the "beautiful" Haskell implementation of the power set function on reddit recently. Here's a cute twist on that:
`> powerset1 = filterM (const \$ G [Left False,Left True])> powerset2 = filterM (const \$ G [Left False,Right True])`

The first gives a result similar to the original, but the second lists the subsets in such a way that each element differs from the previous one by one element.

I wonder what other algebarically motivated monads are waiting to be discovered.

And I hope to write something about deriving comonads from coalgebras, as soon as I've read this. It looks really intriguing. Problem is, I'm having trouble making sense of it.

And sorry, this is a bit incoherent today. I only get a short period of time to write this stuff up and I blew most of mine this week in traffic jams. As usual, if the text makes no sense, you can be sure the code above works as I just tested it. But ask questions and complain about errors...

* The tropical semiring is the proper name for the (R,min,+) semiring I talked about earlier.

Update: Saizan's doing a great job of keeping me honest. He noticed that I'd omitted a step. I was originally finding the inverse of a free group element by simply reversing the elements in the product. But I was failing to flip the polarities of each of the elements (so to speak) despite having spelled out in the text exactly what I needed to do. The code is now fixed.

Labels: ,

## Saturday, June 16, 2007

### How to write tolerably efficient optimzation code without really trying...

...or the (freely generated module over the (min,+)-semiring)-monad. (But if I used that as a title you probably wouldn't be reading this.)

There's a large class of optimisation problems that can be considered to be about graph searching. This can include anything from finding the shortest distance between two cities, through finding the edit distance between two strings to finding the most plausible reconstruction of a signal corrupted by noise. In each case there's an obvious implementation of a solution in Haskell: use the list monad to generate all of the paths in the search space and then find the optimal path. The problem with this is that generating all paths can result in combinatorial explosion with exponential running times. So this post is about what I think is a novel monad to fix that problem. It gives a simple DSL for specifying a wide range of optimisation problems with polynomial instead of exponential complexity. In some cases it results in something close to the best published algorithm with little effort on the part of the user.

This is quite a neat and useful monad. So if the following is hard going, skip to the examples first for some motivation. I'd hate people to miss out on something that turns out to be fairly easy to use once it's set up.

Unfortunately, despite the expressiveness of Haskell making this possible, we have to fight Haskell on two fronts. One is the old annoyance of Haskell not supporting restricted monads. For a solution to that I'll refer you to Eric Kidd who used an Oleg trick. So let's get the first one out of the way before talking about graphs. Here's a preamble that replaces the Prelude's monad class with our own:
`> {-# LANGUAGE MultiParamTypeClasses,>            NoImplicitPrelude,>            UndecidableInstances #-}> import Debug.Trace> import Prelude hiding ((>>=),(>>),return)> import Data.Monoid> import Data.Map (toList,fromListWith)> class Monad1 m a where>  return :: a -> m a> class (Monad1 m a, Monad1 m b) => Monad2 m a b where>   (>>=) :: m a -> (a -> m b) -> m b>   (>>) :: m a -> m b -> m b>   l >> f = l >>= \_ -> f> class MonadPlus m where>    mzero :: m a>    mplus :: m a -> m a -> m a`

So here's a graph with lengths marked on the edges: Suppose we want to find the shortest path from A to F in 3 hops. Before solving that, consider a more general problem. In the following graph suppose we wish to find the shortest n-hop paths from A to the Ci. No matter what complexity is in the cloud, each of the paths must travel through one of the Bi. So we can solve the shortest paths problem by finding the shortest (n-1)-hop paths to the Bi and using:

distn(A,Ci) = minj(distn-1(A,Bj)+dji)

Here's the crucial point to notice: If we consider distn(A,Bi) to be a vector whose indices are labelled by i, and dji to be a matrix, then the above formula is an ordinary matrix multiplication except that instead of addition and multiplication we use min and addition. If you've studied graph algorithms you'll recognise this as the same multiplication that appears in the Floyd-Warshall algorithm. If we extend the real numbers with infinity, then (R∪∞,min,+) forms a semiring. (See Cormen, Leiserson & Rivest for some good discussion of semirings in this context.) Just as you can form vector spaces over a field, you can form modules over this semiring. And just as vector spaces form a monad, so do these modules. But you don't need to know much about semirings because the monad itself is a fairly straightforward object.

Here's that infinity I was talking about:
`> infinity :: Fractional a => a> infinity = 1/0`

I have to admit, I enjoyed writing that line of code :-) More importantly, Haskell does the right thing with infinities.

The monad is going to represent the length of a path to a graph vertex. Here's the definition:
`> data M l p a = M { runM :: [(a,p,l)] } deriving (Eq,Show,Ord)`

Ignore the l part for now, that's going to serve as a place to make a log, like the Writer monad. We'll need that when we want the actual shortest path, not just its length. For now we'll just use an empty list. So suppose that at some stage in our computation we've reached the node B with a shortest path of length 3 and node F with a shortest path of length 4. We represent this as
u = M [(B,3,[]),(F,4,[])] :: M Vertex Float [()]
.
Suppose we now consider what happens at the next step. We can express the options from B with
v = M [(A,1,[]),(D,3,[])]

and those from F as
w = M [(D,3,[]),(E,3,[])].

To combine these with the earlier computed paths we could replace B and F in u with v and w respectively to get:
M [(M [(A,1,[]),(D,3,[])],3,[]),(M [(D,3,[]),(E,3,[])],4,[])] :: M (M Vertex Float [()]) Float [()].

We need to reduce this back down to a M Vertex Float [()] using the matrix 'multiplication' formula above to pick the appropriate shortest path each time. But note the pattern: it's the usual M M a -> M a monad pattern. So we can write the substitution+reduction as an implementation of >>=. Here's the implementation, along with some other monadic goodness:
`> instance (Ord a,Monoid l,Num p) => Monad1 (M l p) a where>   return a = M [(a,0,mempty)]> instance (Ord a,Monoid l,Fractional p,Ord b,Ord p,Ord l) => Monad2 (M l p) a b where>   M x >>= f = let>        applyf = [(f a,p,l) | (a,p,l) <- x]>        rawresult = concat [[(a',(p+p',l `mappend` l')) | (a',p',l') <- a] | (M a,p,l) <- applyf]>        reduced = map (\(a,(p,l)) -> (a,p,l)) \$ toList \$ fromListWith min \$ rawresult>        filtered = filter (\(_,p,l) -> p/=infinity) reduced>        in M filtered> instance MonadPlus (M l p) where>    mzero = M mempty>    mplus (M a) (M b) = M (a ++ b)> guard True = return ()> guard False = mzero> choose x = M \$ map (\(a,p) -> (a,p,mempty)) x`

I've also had to provide my own implementation of MonadPlus. choose is a convenience function so we don't have to keep entering those empty lists.

So here's that graph represented in terms of a function that describes transitions from vertex to vertex:
`> data Vertex = A | B | C | D | E | F deriving (Eq,Show,Ord)> graph A = choose [(B,1),(C,2)]> graph B = choose [(A,1),(D,3)]> graph C = choose [(A,2),(D,4),(E,1)]> graph D = choose [(B,3),(C,4),(E,2),(F,3)]> graph E = choose [(C,1),(D,2),(F,3)]> graph F = choose [(D,3),(E,3)]`

And here's our first example. We compute all of the shortest 3-hop paths from A:
`> ex0 :: M [()] Float Vertex> ex0 = do>    step1 <- return A>    step2 <- graph step1>    step3 <- graph step2>    step4 <- graph step3>    return step4`

Just in case you think that's the wrong result note that these are the shortest paths to each accessible vertex that are exactly 3 hops long. If you want paths up to 3 hops long you have to add some zero length transitions from nodes to themselves. I'll leave that as a (trivial) exercise.

But suppose you just want the shortest path to F. Then you can use:
`> ex1 :: M [()] Float Vertex> ex1 = do>    step1 <- return A>    step2 <- graph step1>    step3 <- graph step2>    step4 <- graph step3>    guard \$ step4==F>    return step4`

Note how simple this code is. We just write code to walk the graph. All of the optimisation machinery is hidden in the background. But there's a dark secret here because the code isn't yet doing what a claim. But before I admit what the problem is, let's modify this code so we can actually see what the shortest paths are. record is just like tell from the Writer monad. I've implemented >>= so that it does the right thing:
`> record l = M [((),0,l)]> ex2 = do>   step1 <- choose [(A,0),(B,0)]>   record [step1]>   step2 <- graph step1>   record [step2]>   step3 <- graph step2>   record [step3]>   step4 <- graph step3>   record [step4]>   guard (step4==E || step4==F)>   return (step1,step4)`

Just for fun I made this example more complex. It finds all four shortest paths from anywhere in {A,B} to anywhere in {E,F}. As an exercise try modifying the above code so that it finds the one shortest path from {A,B} to {E,F}. It's also simple to modify it to deal with conditions like the traveller being at one of a particular set of nodes at a particular time. It's a DSL, you have considerable freedom to form whatever conditions you want.

And now it's time to reveal the secret. Although I wrote >>= so that it reduces the list of considered paths at each stage the above code still has exponential complexity as you'll rapidly discover as you increase the number of steps.

Consider the third monad law: (m >>= f) >>= g = m >>= (\x -> f x >>= g).

If we evaluate the left hand side for this monad we get a reduction after applying f and then another after applying g. But we if we evaluate the right hand side we find that we get many small (and useless) reductions applied and then at the end a large reduction is applied. We don't get a big reduction at the intermediate step. The result: exponential complexity because the code ends up exploring every possible path performing stupid trivial reductions at each step. The monad law says we get the same result, but it doesn't take the same time.

And here's the really annoying thing: when you use do-notation in the usual way the Haskell compiler builds expressions like the slow right hand side. Bummer! But there's an easy, if tedious, fix. We have to use the third monad law as a rewrite rule to convert our code to the form on the left hand side. Here's ex2 rewritten:
`> ex3 = do>   step1 <- choose [(A,0),(B,0)]>   step4 <- do>       step3 <- do>           step2 <- do>               record [step1]>               graph step1>           record [step2]>           graph step2>       record [step3]>       graph step3>   do>       record [step4]>       guard (step4==E || step4==F)>   return (step1,step4)`

Not as pretty is it? It's still not too bad but it'd be nice to make this conversion automatic. I've a hunch this can be done using Haskell rules. Anyone care to volunteer some help making this work?

Anyway, I mentioned that lots of problems are in this class, some not obviously graph theoretical. Here's one: computing the edit distance between two strings:
`> diag (a:as,b:bs) = choose [ ((as,bs),if a==b then 0 else 1) ]> diag _           = mzero> right (a:as,bs) = choose [((as,bs),1)]> right _         = mzero> down (as,b:bs) = choose [((as,bs),1)]> down _         = mzero> nop ([],[]) = return ([],[])> nop _       = mzero> munch :: (String,String) -> M [()] Float (String,String)> munch x = diag x `mplus` right x `mplus` down x `mplus` nop x> pure (M a) = length a==1> editDistance a b = until pure (>>= munch) (munch (a,b))> ex4 = editDistance "hypocrisy" "hippocracy"`

I think this is actually the dynamic programming algorithm in disguise (and slightly reordered) with some extra log time penalties because it's using a binary tree (from Data.Set) instead of arrays. But I wouldn't stake my life on it.

Anyway, here's one last example. Rather than explain it I refer you to the nicely written Wikipedia on the Viterbi algorithm. In fact, I decided to lift their example precisely. And note that I've (1) written the code twice, the second one is the one rewritten for speed, (2) used logarithms to convert the multiplications to additions and (3) 'unrolled' the loop to make absolutely explicit what's going on. If I wrote Haskell code to work on general arrays like the Python code in the Wikipedia article, it'd be too short and you might miss it if you blinked.
`> data Weather = Rainy | Sunny deriving (Eq,Show,Ord)> data Activity = Walk | Shop | Clean deriving (Eq,Show,Ord)> transition Rainy = choose [(Rainy,-log 0.7),(Sunny,-log 0.3)]> transition Sunny = choose [(Rainy,-log 0.4),(Sunny,-log 0.6)]> emission Rainy = choose [(Walk,-log 0.1),(Shop,-log 0.4),(Clean,-log 0.5)]> emission Sunny = choose [(Walk,-log 0.6),(Shop,-log 0.3),(Clean,-log 0.1)]> ex5 = do>       day1 <- choose [(Rainy,-log 0.6),(Sunny,-log 0.4)]>       record [day1]>       emission1 <- emission day1>       guard (emission1 == Walk)>       day2 <- transition day1>       record [day2]>       emission2 <- emission day2>       guard (emission2 == Shop)>       day3 <- transition day2>       record [day3]>       emission3 <- emission day3>       guard (emission3 == Clean)>       day4 <- transition day3>       record [day4]> ex6 = do>       day4 <- do>           day3 <- do>               day2 <- do>                   day1 <- choose [(Rainy,-log 0.6),(Sunny,-log 0.4)]>                   do>                       record [day1]>                       do>                           emission1 <- emission day1>                           guard (emission1 == Walk)>                   transition day1>               do>                   record [day2]>                   do>                       emission2 <- emission day2>                       guard (emission2 == Shop)>               transition day2>           do>               record [day3]>               do>                   emission3 <- emission day3>                   guard (emission3 == Clean)>           transition day3>       record [day4]`

Anyway, I haven't proved a thing above. I think (hope?) I'm right about the polynomial complexity. Certainly on paper it all works, but I've always found it hard to reason about the complexity of Haskell code.

Unfortunately this code is all still slower than I'd like. I've played with trace during the development and everything seemed to run the way I expect (at least after I discovered the third monad law problem). I think that ultimately there's a big overhead for piling up HOFs, and the lookups in Data.Set probably aren't helping. Still, I don't think that invalidates the mathematics behind the approach and I think that asymptotically the slowness is really a big constant and a bunch of logs and it's still much better than combinatorial search. I ought to see how ocaml fares.

BTW There's a deep connection between this monad and the quantum mechanics one. I've no time to write about that now, though see here.

Some acknowledgements: the strange but highly readable Wikipedia article motivated me to think about the Viterbi algorithm, and I became sure there was a monad lurking in the Python code. I spent a month or so getting nowhere. And then I spoke with Eric Kidd on the phone about probability monads. He challenged me on something and in an attempt to formulate a response the whole of the above seemed to just crystallise. Thanks for asking the right question Eric!

I think there might be a comonad for working backwards through the graph...

Anyway, one last thing: please for the love of the FSM will someone put a clean way to implement restricted monads into their Haskell compiler. There is so much neat stuff that could be implemented with it and I now officially consider Haskell broken without one.

Labels: ,