Saturday, January 31, 2009

Beyond Regular Expressions: More Incremental String Matching

In my last post I showed how to incrementally match long strings against regular expressions. I now want to apply similar methods to matching languages that can't be described by regular expressions. (Note that 'language' is just jargon for a set of strings that meet some criterion.) In particular, regular expressions can't be used to test a string for balanced parentheses. This is because we need some kind of mechanism to count how many open parentheses are still pending and a finite state machine can't represent arbitrary integers.

So let's start with a slightly more abstract description of what was going on last time so we can see the bigger picture. We were storing strings in balanced trees with a kind of 'measurement' or 'digest' of the string stored in the nodes of the tree. Each character mapped to an element of a monoid via a function called measure and you can think of the measurement function as acting on entire strings if you mappend together all of the measurements for each of the characters. So what we have is a function f :: String -> M taking strings to some type M (in the last post M was a type of array) with the properties

f (a ++ b) == f a `mappend` f b
f [] == mempty

By noticing that String is itself a monoid we can write this as

f (a `mappend` b) == f a `mappend` f b
f mempty == mempty

Anything satisfying these laws is called a monoid homomorphism, or just homomorphism for short.

So the technique I used worked like this: I found a homomorphism from String to some type with the useful property that for any string s, f s still contains all the information required to figure out if we're dealing with a member of our language. If f turns a string into something more efficient to work with then we can make our string matching more efficient.

Now I want to make the notion of "contains all the information required" more precise by considering an example. Consider strings that consist only of the characters ( and ). Our language will be the set of strings whose parentheses balance. In other words the total number of ( must match the total number of ), and as we scan from left to right we must never see more ) than (. For example, ()()() and ((()))() are in our language, but )()()( isn't. This language is called the Dyck language.

Suppose we're testing whether or not some string is in the Dyck language. If we see () as a substring then if we delete it from the string, it makes no difference to whether or not the string is in the Dyck language. In fact, if we see (()), ((())), (((()))) and so on they can all be deleted. On the other hand, you can't delete )( without knowing about the rest of the string. Deleting it from ()() makes no difference to its membership in the Dyck language, but deleting it from )(() certainly does.

So given a language L, we can say that two strings, x and y, are interchangeable with respect to L if any time we see x as a substring of another string we can replace it with y, and vice versa, without making any difference to whether the string is in the language. Interchangeable strings are a kind of waste of memory. If we're testing for membership of L there's no need to distinguish between them. So we'd like our measurement homomorphism to map all interchangeable strings to the same values. But we don't want to map any more strings to the same value because then we lose information that tells us if a string is an element of L. A homomorphism that strikes this balance perfectly is called the 'canonical homomorphism' and the image of the set of all strings under this homomorphisms is called the syntactic monoid. By 'image', I simply mean all the possible values that could arise from applying the homomorphism to all possible strings.

So lets go back to the Dyck language. Any time we see () we can delete it. But if we delete every occurence of () from a string then all we have left is a bunch of ) followed by a bunch of (. Let's say it's p of the former, and q of the latter. Every string of parentheses can be distilled down to a pair of integers ≥0, (p,q). But does this go far enough? Could we distill any further? Well for any choice of (p,q) it's a good exercise to see that for any other choice of (p',q') there's always a string in the Dyck language where if you have )p(q as a substring, replacing it with (p',q') gives you something not in the language. So you can't distill any further. Which means we have our syntactic monoid and canonical homomorphism. In this case the monoid is called the bicyclic monoid and we can implement it as follows:


> {-# LANGUAGE TypeSynonymInstances,FlexibleInstances,MultiParamTypeClasses #-}
> import Data.Foldable
> import Data.Monoid
> import Data.FingerTree hiding (fromList)
> import qualified Data.List as L

> data Bicyclic = B Int Int deriving (Eq,Show)

> hom '(' = B 0 1
> hom ')' = B 1 0

> instance Monoid Bicyclic where
> mempty = B 0 0
> B a b `mappend` B c d = B (a-b+max b c) (d-c+max b c)


Where did that code for mappend come from? Consider )a(b)c(d. We can delete () from the middle many times over.

Now we can more or less reproduce the code of last week and get a Dyck language tester. Once we've distilled a string down to (p,q) we only need to test whether or not p=q=0 to see whether or not our parentheses are balanced:


> matches' s = x==B 0 0 where
> x = mconcat (map hom s)

> data Elem a = Elem { getElem :: a } deriving Show
> data Size = Size { getSize :: Int } deriving (Eq,Ord,Show)

> instance Monoid Size where
> mempty = Size 0
> Size m `mappend` Size n = Size (m+n)

> instance Measured (Size,Bicyclic) (Elem Char) where
> measure (Elem a) = (Size 1,hom a)

> type FingerString = FingerTree (Size,Bicyclic) (Elem Char)

> insert :: Int -> Char -> FingerString -> FingerString
> insert i c z = l >< (Elem c <| r) where (l,r) = split (\(Size n,_) -> n>i) z

> string = empty :: FingerString

> matchesDyck string = snd (measure string)==B 0 0

> loop string = do
> print $ map getElem (toList string)
> print $ "matches? " ++ show (matchesDyck string)
> print "(Position,Character)"
> r <- getLine
> let (i,c) = read r
> loop $ insert i c string

> main = do
> loop string


There's a completely different way to test membership of the Dyck language. Replace each ( with 1 and ) with -1. Now scan from left to right keeping track of (1) the sum of all the numbers so far and (2) the minimum value taken by this sum. If the final sum and the final minimal sum are zero, then we have matching parentheses. But we need to do this on substrings without scanning from the beginning in one go. That's an example of a parallel prefix sum problem and it's what I talked about here.

So here's an extended exercise: adapt the parallel prefix sum approach to implement incremental Dyck language testing with fingertrees. You should end up with a canonical homomorphism that's similar to the one above. It'll probably be slightly different but ultimately equivalent.

And here's an even more extended exercise: protein sequences are sequences from a 20 letter alphabet. Each letter can be assigned a hydrophobicity value from certain tables. (Pick whichever table you want.) The hydrophobicity of a string is the sum of the hydrophobicities of its letters. Given a string, we can give it a score corresponding to the largest hydrophobicity of any contiguous substring in it. Use fingertrees and a suitable monoid to track this score as the string is incrementally edited. Note how widely separated substrings can suddenly combine together as stuff between them is adjusted.

If you're interested in Dyck languages with multiple types of parenthesis that need to match you need something much more fiendish.

Saturday, January 24, 2009

Fast incremental regular expression matching with monoids

The Problem


Consider this problem: Fix a regular expression R. Suppose you have a string of length N. There's not much you can do about it, you'll likely have to scan all N characters to test so see if the string matches R. But once you've performed the test, how fast can you test the string again if a small edit is made to it? It seems that in general you'd have to rescan the entire string, or at least rescan from where the edit was made. But it turns out that you can do regular expression matching incrementally so that for many changes you might make to the string, you only require O(log N) time to recompute whether the string matches. This is true even if characters at opposite ends of the string interact to make a succcessful match. What's more, it's remarkably straightforward to implement if we make use of fingertrees and monoids.

I'm going to assume a bit of background for which resources can be found on the web: understanding some basics about monoids, understanding apfelmus's inspiring introduction to fingertrees or the original paper, and you'll need to be completely comfortable with the idea of compiling regular expressions to finite state machines.

As usual, this is literate Haskell. But you need to have the fingertree package installed and we need a bunch of imports.


> {-# LANGUAGE TypeSynonymInstances,FlexibleInstances,MultiParamTypeClasses #-}
> import qualified Data.Array as B
> import Data.Array.Unboxed as U
> import Data.Foldable
> import Data.Monoid
> import Data.FingerTree hiding (fromList)
> import qualified Data.List as L


A Finite State Machine


So let's start with an example regular expression: .*(.*007.*).*. We're looking for "007" enclosed between parentheses, but the parentheses could be millions of characters apart.

A standard technique for finding regular expressions is to compile them to a finite state automaton. It takes quite a bit of code to do that, but it is completely standard. So rather than do that, here's a finite state machine I constructed by hand for this regular expression:

I've used the convention that an unlabelled edge corresponds to any input that isn't matched by another labelled edge. We can express the transitions as a function fsm.


> fsm 0 '(' = 1
> fsm 0 _ = 0

> fsm 1 '0' = 2
> fsm 1 _ = 1

> fsm 2 '0' = 3
> fsm 2 _ = 1

> fsm 3 '7' = 4
> fsm 3 '0' = 3
> fsm 3 _ = 1

> fsm 4 ')' = 5
> fsm 4 _ = 4

> fsm 5 _ = 5


The initial state is 0 and a match corresponds to state 5.

We can test a string in the standard way using


> matches s = Prelude.foldl fsm 0 s==5


Try matches "(00 7)" and matches "He(007xxxxxxxxxxxx)llo".

We can thing of the inputs as being functions acting on the automaton. Each input character is a function that maps the automaton from one state to another. We could use the Haskell composition function, (.), to compose these functions. But (.) doesn't really do anything, f . g is just a closure that says "when the time comes, apply g and then f.

On the other hand we could tabulate our transition functions as follows:


> tabulate f = array (0,5) [(x,f x) | x <- range (0,5)]


We have one such tabulated function for each letter in our alphabet:


> letters = array (' ','z') [(i,tabulate (flip fsm i)) | i <- range (' ','z')]


Given two tabulated functions we can easily form the table of the composition function. In fact, our tabulated functions form a monoid with mappend for composition. I used unboxed arrays for performance:


> type Table = UArray Int Int

> instance Monoid Table where
> mempty = tabulate id
> f `mappend` g = tabulate (\state -> (U.!) g ((U.!) f state))


Note that we've cheated a bit. An object of type Table could be any array of Ints indexed by Int. But if we promise to only build arrays indexed by elements of [0..5] and containing elements of the same range then our claim to monoidhood is valid.

Given any string, we can compute whether it matches our regular expression by looking up the corresponding Table in our letters array, composing them, and then checking if the tabulated function maps the initial state 0 to the final state 5:


> matches' s = table!0==5 where
> table = mconcat (map ((B.!) letters) s)


This is slower and more complex than our original implementation of matches. But what we've now done is 'tease out' a monoid structure from the problem. If we store a string as a sequence of characters represented by a fingertree, we can store in each subtree the element of Table corresponding to the substring it represents. Every time the tree is rebalanced we need to recompute the corresponding Tables. But that's fine, it typically involves only O(log N) operations, and we don't need to write any code, the fingertree will do it for us automatically. Once we've done this, we end up with a representation of strings with the property that we always know what the corresponding Table is. We can freely split and join such trees knowing that the Table will always be up to date.

The only slight complication is that I want to be able to randomly access the nth character of the tree. apfelmus explains that in his post. The change I need to make is that I'm going to use both the Size monoid and the Table monoid, so I need the product monoid.


> data Elem a = Elem { getElem :: a } deriving Show
> data Size = Size { getSize :: Int } deriving (Eq,Ord,Show)

> instance Monoid Size where
> mempty = Size 0
> Size m `mappend` Size n = Size (m+n)


We need to implement measure as in the fingertree paper:


> instance Measured (Size,Table) (Elem Char) where
> measure (Elem a) = (Size 1,(B.!) letters a)


A Fingertree


And now we can define our strings as:


> type FingerString = FingerTree (Size,Table) (Elem Char)


The insertion routine is more or less what's in the paper:


> insert :: Int -> Char -> FingerString -> FingerString
> insert i c z = l >< (Elem c <| r) where (l,r) = split (\(Size n,_) -> n>i) z


Note how I project out the size from the product monoid in order to insert at the correct position.

Here's an example string. Adjust the length to suit your memory and CPU horsepower:


> fromList = L.foldl' (|>) empty
> string = fromList (map Elem $ take 1000000 $ cycle "the quick brown fox jumped over the lazy dog")


(I use a strict form of fromList to ensure the tree actually gets built.)

The actual match function simply projects out the second component of the monoid and again tests to see if it maps the initial state to the final state:


> matches007 string = ((U.!) (snd (measure string)) 0)==5


An Interactive Loop


I recommend compiling with optimisation, something like ghc --make -O5 -o regexp regexp.lhs:


> loop string = do
> print $ "matches? " ++ show (matches007 string)
> print "(Position,Character)"
> r <- getLine
> let (i,c) = read r
> loop $ insert i c string

> main = do
> loop string


Now you can run this interactively. Input values like (100,'f') to insert an 'f' at position 100. It can take a good few seconds to compute the initial tree, but after that the matching process is instantaneous. (Actually, the second match might take a few seconds, that's because despite the foldl' the tree hasn't been fully built.)

A suitable sample input is:

(100,'(')
(900000,')')
(20105,'0')
(20106,'0')
(20107,'7')


Discussion


Note there is quite an overhead for this example. I'm storing an entire Table for each character. But you can easily store chunks of string (like in a rope). This means that some chunks will be rescanned when a string is edited - but rescanning a 1K chunk, say, is a lot less expensive than scanning a gigabyte file in its entirety. Working in blocks will probably speed up the initial scan too, a much smaller tree needs to be built.

When Hinze and Patterson originally wrote the fingertree paper they were motivated by parallel prefix sum methods. Just about any parallel prefix algorithm can be converted to an incremental algorithm using fingertrees. This article is based on the idea of doing this with the parallel lexing scheme described by Hillis and Steele in their classic Connection Machine paper.

So why would you want to match against a fixed regular expression like this? Well this method extends to a full blown incremental lexer. This will lex quickly even if placing a character in a string changes the type of lexemes billions of characters away. See the Hillis and Steele paper for details.

Note there's nothing especially Haskelly about this code except that Haskell made it easy to prototype. You can do this in C++, say, using mutable red-black trees.

Saturday, January 17, 2009

Haskell Monoids and their Uses

Haskell is a great language for constructing code modularly from small but orthogonal building blocks. One of these small blocks is the monoid. Although monoids come from mathematics (algebra in particular) they are found everywhere in computing. You probably use one or two monoids implicitly with every line of code you write, whatever the language, but you might not know it yet. By making them explicit we find interesting new ways of constructing those lines of code. In particular, ways that are often easier to both read and write. So the following is an intro to monoids in Haskell. I'm assuming familiarity with type classes, because Haskell monoids form a type class. I also assume some familiarity with monads, though nothing too complex.

This post is literate Haskell so you can play with the examples directly.

Defining Monoids


In Haskell, a monoid is a type with a rule for how two elements of that type can be combined to make another element of the same type. To be a monoid there also needs to be an element that you can think of as representing 'nothing' in the sense that when it's combined with other elements it leaves the other element unchanged.

A great example is lists. Given two lists, say [1,2] and [3,4], you can join them together using ++ to get [1,2,3,4]. There's also the empty list []. Using ++ to combine [] with any list gives you back the same list, for example []++[1,2,3,4]==[1,2,3,4].

Another example is the type of integers, Integer. Given two elements, say 3 and 4, we can combine them with + to get 7. We also have the element 0 which when added to any other integer leaves it unchanged.

So here is a possible definition for the monoid type class:

class Monoid m where
mappend :: m -> m -> m
mempty :: m

The function mappend is the function we use to combine pairs of elements, and mempty is the 'nothing' element. We can make lists an instance like this:


instance Monoid [a] where
mappend = (++)
mempty = []


Because we want mempty to do nothing when combined with other elements we also require monoids to obey these two rules

a `mappend` mempty = a

and

mempty `mappend` a = a.


Notice how there are two ways to combine a and b using mappend. We can write a `mappend` b or b `mappend` a. There is no requirement on a monoid that these be equal to each other. (But see below.) But there is another property that monoids are required to have. Suppose we start with the list [3,4]. And now suppose we want to concatenate it with [1,2] on the left and [5,6] on the right. We could do the left concatenation first to get [1,2]++[3,4] and then form ([1,2]++[3,4])++[5,6]. But we could do the right one first and get [1,2]++([3,4]++[5,6]). Because we're concatenating at opposite ends the two operations don't interfere and it doesn't matter which we do first. This gives rise to the third and last requirement we have of monoids:

(a `mappend` b) `mappend` c == a `mappend` (b `mappend` c)

and you can summarise it with the slogan 'combining on the left doesn't interfere with combining on the right'. Notice how the integers, combined with +, also have this property. It's such a useful property it has a name: associativity.

That's a complete specification of what a monoid is. Haskell doesn't enfore the three laws I've given, but anyone reading code using a monoid will expect these laws to hold.

Some Uses of Monoids


But given that we already have individual functions like ++ and +, why would we ever want to use mappend instead?

One reason is that with a monoid we get another function called mconcat for free. mconcat takes a list of values in a monoid and combines them all together. For example mconcat [a,b,c] is equal to a `mappend` (b `mappend` c). Any time you have a monoid you have this quick and easy way to combine a whole list together. But note that there is some ambiguity in the idea behind mconcat. To compute mconcat [a,b,...,c,d] which order should we work in? Should we work from left to right and compute a `mappend` b first? Or should we start with c `mappend` d. That's one place where the associativity law comes in: it makes no difference.

Another place where you might want to use a monoid is in code that is agnostic about how you want to combine elements. Just as mconcat works with any monoid, you might want to write your own code that works with any monoid.

Explicitly using the Monoid type class for a function also tells the reader of your code what your intentions are. If a function has signature [a] -> b you know it takes a list and constructs an object of type b from it. But it has considerable freedom in what it can do with your list. But if you see a function of type (Monoid a) => a -> b, even if it is only used with lists, we know what kind of things the function will do with the list. For example, we know that the function might add things to your list, but it's never going to pull any elements out of your list.

The same type can give rise to a monoid in different ways. For example, I've already mentions that the integers form a monoid. So we could define:

instance Monoid Integer where
mappend = (+)
mempty = 0

But there's a good reason not to do that: there's another natural way to make integers into a monoid:


instance Monoid Integer where
mappend = (*)
mempty = 1

We can't have both of these definitions at the same time. So the Data.Monoid library doesn't make Integer into a Monoid directly. Instead, it wraps them with Sum and Product. It also does so more generally so that you can make any Num type into a monoid in two different ways. We have both

Num a => Monoid (Sum a)

and

Num a => Monoid (Product a)

To use these we wrap our values in the appropriate wrapper and we can then use the monoid functions. For example mconcat [Sum 2,Sum 3,Sum 4] is Sum 9, but mconcat [Product 2,Product 3,Product 4] is [Product 24].

Using Sum and Product looks like a complicated way to do ordinary addition and multiplication. Why do things that way?

The Writer Monad


You can think of monoids as being accumulators. Given a running total, n, we can add in a new value a to get a new running total n' = n `mappend` a. Accumulating totals is a very common design pattern in real code so it's useful to abstract this idea. This is exactly what the Writer monad allows. We can write monadic code that accumulates values as a "side effect". The function to perform the accumulation is (somewhat confusingly) called tell. Here's an example where we're logging a trace of what we're doing.


> import Data.Monoid
> import Data.Foldable
> import Control.Monad.Writer
> import Control.Monad.State

> fact1 :: Integer -> Writer String Integer
> fact1 0 = return 1
> fact1 n = do
> let n' = n-1
> tell $ "We've taken one away from " ++ show n ++ "\n"
> m <- fact1 n'
> tell $ "We've called f " ++ show m ++ "\n"
> let r = n*m
> tell $ "We've multiplied " ++ show n ++ " and " ++ show m ++ "\n"
> return r


This is an implementation of the factorial function that tells us what it did. Each time we call tell we combine its argument with the running log of all of the strings that we've 'told' so far. We use runWriter to extract the results back out. If we run


> ex1 = runWriter (fact1 10)


we get back both 10! and a list of what it took to compute this.

But Writer allows us to accumulate more than just strings. We can use it with any monoid. For example, we can use it to count how many multiplications and subtractions were required to compute a given factorial. To do this we simply tell a value of the appropriate type. In this case we want to add values, and the monoid for addition is Sum. So instead we could implement:


> fact2 :: Integer -> Writer (Sum Integer) Integer
> fact2 0 = return 1
> fact2 n = do
> let n' = n-1
> tell $ Sum 1
> m <- fact2 n'
> let r = n*m
> tell $ Sum 1
> return r



> ex2 = runWriter (fact2 10)


There's another way we could have written this, using the state monad:


> fact3 :: Integer -> State Integer Integer
> fact3 0 = return 1
> fact3 n = do
> let n' = n-1
> modify (+1)
> m <- fact3 n'
> let r = n*m
> modify (+1)
> return r

> ex3 = runState (fact3 10) 0


It works just as well, but there is a big advantage to using the Writer version. It has type signature f :: Integer -> Writer (Sum Integer) Integer. We can immediately read from this that our function has a side effect that involves accumulating a number in a purely additive way. It's never going to, for example, multiply the accumulated value. The type information tells us a lot about what is going on inside the function without us having to read a single line of the implementation. The version written with State is free to do whatever it likes with the accumulated value and so it's harder to discern its purpose.

Data.Monoid also provides an Any monoid. This is the Bool type with the disjunction operator, better known as ||. The idea behind the name is that if you combine together any collection of elements of type Any then the result is Any True precisely when at least any one of the original elements is Any True. If we think of these values as accumulators then they provide a kind of one way switch. We start accumulating with mempty, ie. Any False, and we can think of this as being the switch being off. Any time we accumulate Any True into our running 'total' the switch is turned on. This switch can never be switched off again by accumulating any more values. This models a pattern we often see in code: a flag that we want to switch on, as a side effect, if a certain condition is met at any point.


> fact4 :: Integer -> Writer Any Integer
> fact4 0 = return 1
> fact4 n = do
> let n' = n-1
> m <- fact4 n'
> let r = n*m
> tell (Any (r==120))
> return r

> ex4 = runWriter (fact4 10)


At the end of our calculation we get n!, but we are also told if at any stage in the calculation two numbers were multiplied to give 120. We can almost read the tell line as if it were English: "tell my caller if any value of r is ever 120". Not only do we get the plumbing for this flag with a minimal amount of code. If we look at the type for this version of f it tells us exactly what's going on. We can read off immediately that this function, as a "side effect", computes a flag that can be turned on, but never turned off. That's a lot of useful information from just a type signature. In many other programming languages we might expect to see a boolean in the type signature, but we'd be forced to read the code to get any idea of how it will be used.

Commutative Monoids, Non-Commutative Monoids and Dual Monoids



Two elements of a monoid, x and y, are said to commute if x `mappend` y == y `mappend` x. The monoid itself is said to be commutative if all of its elements commute with each other. A good example of a commutative monoid is the type of integers. For any pair of integers, a+b==b+a.

If a monoid isn't commutative, it's said to be non-commutative. If it's non-comuutative it means that for some x and y, x `mappend` y isn't the same as y `mappend` x, so mappend and flip mappend are not the same function. For example [1,2] ++ [3,4] is different from [3,4] ++ [1,2]. This has the interesting consequence that we can make another monoid in which the combination function is flip mappend. We can still use the same mempty element, so the first two monoid laws hold. Additionally, it's a nice exercise to prove that the third monoid law still holds. This flipped monoid is called the dual monoid and Data.Monoid provides the Dual type constructor to build the dual of a monoid. We can use this to reverse the order in which the writer monad accumulates values. For example the following code collects the execution trace in reverse order:


> fact5 :: Integer -> Writer (Dual String) Integer
> fact5 0 = return 1
> fact5 n = do
> let n' = n-1
> tell $ Dual $ "We've taken one away from " ++ show n ++ "\n"
> m <- fact5 n'
> tell $ Dual $ "We've called f " ++ show m ++ "\n"
> let r = n*m
> tell $ Dual $ "We've multiplied " ++ show n ++ " and " ++ show m ++ "\n"
> return r

> ex5 = runWriter (fact5 10)


The Product Monoid



Suppose we want to accumulate two side effects at the same time. For example, maybe we want to both count instructions and leave a readable trace of our computation. We could use monad transformers to combine two writer monads. But there is a slightly easier way - we can combine two monoids into one 'product' monoid. It's defined like this:

instance (Monoid a,Monoid b) => Monoid (a,b) where
mempty = (mempty,mempty)
mappend (u,v) (w,x) = (u `mappend` w,v `mappend` x)


Each time we use mappend on the product we actually perform a pair of mappends on each of the elements of the pair. With these small helper functions:


> tellFst a = tell $ (a,mempty)
> tellSnd b = tell $ (mempty,b)


we can now use two monoids simultaneously:


> fact6 :: Integer -> Writer (String,Sum Integer) Integer
> fact6 0 = return 1
> fact6 n = do
> let n' = n-1
> tellSnd (Sum 1)
> tellFst $ "We've taken one away from " ++ show n ++ "\n"
> m <- fact6 n'
> let r = n*m
> tellSnd (Sum 1)
> tellFst $ "We've multiplied " ++ show n ++ " and " ++ show m ++ "\n"
> return r

> ex6 = runWriter (fact6 5)


If we had simply implemented our code using one specific monoid, like lists, our code would be very limited in its application. But by using the general Monoid type class we ensure that users of our code can use not just any individual monoid, but even multiple monoids. This can make for more efficient code because it means we can perform multiple accumulations while traversing a data structure once. And yet we still ensure readability because our code is written using the interface to a single monoid making our algorithms simpler to read.

Foldable Data



One last application to mention is the Data.Foldable library. This provides a generic approach to walking through a datastructure, accumulating values as we go. The foldMap function applies a function to each element of our structure and then accumulates the return values of each of these applications. An implementation of foldMap for a tree structure might be:


> data Tree a = Empty | Leaf a | Node (Tree a) a (Tree a)

> instance Foldable Tree where
> foldMap f Empty = mempty
> foldMap f (Leaf x) = f x
> foldMap f (Node l k r) = foldMap f l `mappend` f k `mappend` foldMap f r


We can now use any of the monoids discussed above to compute properties of our trees. For example, we can use the function (== 1) to test whether each element is equal to 1 and then use the Any monoid to find out if any element of the tree is equal to 1. Here are a pair of examples: one to compute whether or not an element is equal to 1, and another to test if every element is greater than 5:


> tree = Node (Leaf 1) 7 (Leaf 2)

> ex7 = foldMap (Any . (== 1)) tree
> ex8 = foldMap (All . (> 5)) tree


Note, of course, that these expressions can be used, unmodified, with any foldable type, not just trees.

I hope you agree that this expresses our intentions in a way that is easy to read.

That suggests another exercise: write something similar to find the minimum or maximum element in a tree. You may need to construct a new monoid along the lines of Any and All. Try finding both in one traversal of the tree using the product monoid.

The foldable example also illustrates another point. The implementor of foldMap for the tree doesn't need to worry about whether the left tree should be combined with the central element before the right tree. Associativity means it can be implemented either way and give the same results.

Recap


Monoids provide a general approach to combining and accumulating values. They allow us to write code that is agnostic about the method we will use to combine values, and that makes our code more reusable. By using named monoids we can write type signatures that express our intentions to people reading our code: for example by using Any instead of Bool we make it clear just how our boolean value is to be used. And we can combine the monoid-based building blocks provided by Haskell libraries to build useful and readable algorithms with a minimum of effort.

Some final notes: mathematicians often refer to mappend as a 'binary operator' and often it's called 'multiplication'. Just like in ordinary algebra, it's often also written with abuttal or using the star operator, ie. ab and a*b might both represent a `mappend` b. You can read more about monoids at Wikipedia. And I wish I had time to talk about monoid morphisms, and why the list monoid is free (and what consequences that might have for how you can your write code), and how compositing gives you monoids and a whole lot more.


Friday, January 02, 2009

Rewriting Monadic Expressions with Template Haskell

The goal today is to implement an impossible Haskell function. But as this is a literate Haskell post we need to get some boilerplate out of the way:


> module Test where

> import Language.Haskell.TH
> import Control.Monad
> import Control.Monad.Reader
> import Control.Monad.Cont
> import IO

> infixl 1 #


So back to the impossible function: all monads m come equipped with a function of type a -> m a. But it's well known that you can't "extract elements back out of the monad" because there is no function of type m a -> a. So my goal today will be to write such a function. Clearly the first line of code ought to be:


> extract :: Monad m => m a -> a


The rest of the post will fill in the details.

The type declaration tells us what role it plays syntactically, ie. it tells us how we can write code with extract that type checks. But what should the semantics be?

For the IO monad an answer is easy to guess. 1 + extract (readLn :: IO Int) should execute readLn by reading n integer from stdin, strip of the IO part of the the return value and then add 1 to the result. In fact, Haskell already has a function that does exactly that, unsafePerformIO. The goal here is to implement extract in a way that works with any monad.

What might we expect the value of 1 + extract [1,2,3] to be? The value of extract [1,2,3] surely must be 1, 2 or 3. But which one? And what happens if the list is empty? There really isn't any way of answering this while remaining *purely* functional. But if we were running code on a suitable machine we could fork three threads returning one of 1, 2 and 3 in each thread, and then collecing th results together in a list. In other words, we'd expect the final result to be [2,3,4]. This would make extract a lot like McCarthy's ambiguous operator.

So it's clear that we can't interpret extract as a pure function. But we could try implementing it on a new abstract machine. But there's another approach we could take. A couple of times I've talked about a function of type ~~a -> a where ~a is the type a -> Void and Void is the type with no elements. This corresponds to double negation elimination in classical logic. The Curry-Howard isomorphism tells us no such function can be implemented in a pure functional language, but we can translate expressions containing references to such a function into expressions that are completely pure. This is the so-called CPS translation. Anyway, I had this idea that we could do something limilar with monads so that we could translate expressions containing extract into ordinary Haskell code. Turns out there's already a paper on how to do this: Representing Monads by Andrzej Filinski.

To translate all of Haskell this way would be a messy business. But just for fun I thought I'd implement a simpler translation for a small subset of Haskell. It's simply this:

For a choice of monad m denote the translation of both types and values by T. So x::a becomes T(x)::T(a). The translation is simply:


T(a) = m a on types
T(f x) = T(f) `ap` T(x)
T(extract x) = join T(x)


The important thing is that extract of type m a -> a is replaced by join of type m (m a) -> m a.

A great way to translate Haskell code is with Template Haskell. So here's some code:


> (#) x y = liftM2 AppE x y

> rewrite :: Exp -> ExpQ

> rewrite (AppE f x) = do
> e <- [| extract |]
> if f==e
> then [| join |] # rewrite x
> else [| ap |] # rewrite f # rewrite x


Most of the rest is support for some forms of syntactic sugar. First the infix operators:


> rewrite (InfixE (Just x) f Nothing) =
> [| fmap |] # return f # rewrite x

> rewrite (InfixE (Just x) f (Just y)) =
> [| liftM2 |] # return f # rewrite x # rewrite y

> rewrite (InfixE Nothing f (Just y)) =
> [| fmap |] # ([| flip |] # return f) # rewrite y


And list operations. For example [a,b,c] is sugar for a : b : c : [].


> rewrite (ListE []) = [| return [] |]

> rewrite (ListE (x:xs)) =
> [| liftM2 |] # [| (:) |] # rewrite x # rewrite (ListE xs)

> rewrite x =
> [| return |] # return x

> test :: ExpQ -> ExpQ
> test = (>>= rewrite)


extract itself is just a placeholder that is supposd to be translated away:


> extract = undefined


If the above is placed in a file called Test.lhs then you can try compiling the following code.


{-# LANGUAGE TemplateHaskell #-}

import Test
import Control.Monad
import Control.Monad.Reader
import Control.Monad.Cont

ex1 = $(test [|
extract [1,2] + extract [10,20]
|])
ex2 = runReader $(test [|
extract ask + 7
|]) 10
ex3 = $(test [|
1+extract (readLn :: IO Int)
|] )


The big omission is the lack of translation for lambda abstractions. I think that to get this right might requires translating all of the code using -|extract|- from the ground up, not just isolated expressions like those above. And like with CPS, you lose referential transparency and the order in which expressions are evaluated makes a difference.

Anyway, this is a partial answer to the question posed here on "automonadization".