Search This Blog

Saturday, October 31, 2009

Buffon's Needle, the Easy Way

Buffon's needle is a popular probability problem. Rule lines on the floor a distance d apart and toss a needle of length l<d onto it. What is the probability that the needle crosses a line? A solution is described at wikipedia but it involves a double integral and some trigonometry. Nowhere does it mention that there is a less familiar but much simpler proof, though if you follow the links you'll find it. In addition, the usual solution involves π but gives little intuition as to why π appears. The simpler proof reveals that it appears naturally as a ratio of the circumference of a circle to its diameter. I've known this problem since I was a kid and yet I hadn't seen the simpler proof until a friend sold me his copy of Introduction to Geometric Probability for $5 a few days ago.

So instead of solving Buffon's needle problem we'll solve what appears to be a harder problem: when thrown, what is the expectation of the number of times a rigid curved (in a plane) wire length l (no restriction on l) crosses one of our ruled lines d apart? Here's an example of one of these 'noodles'. It crosses the ruled lines three times:

Expectation is linear in the sense that E(A+B) = E(A)+E(B). So if we imagine the wire divided up into N very short segments of length l/N the expectation for the whole wire must be the sum of the expectations for all of the little pieces. If the wire is well behaved, for N large enough the segments are close to identical straight line segments. Here's a zoomed up view of a piece of our noodle:

For a small straight line segment the expectation must simply be a function of the length of the segment. The expectation for the whole wire is the expectation for one segment multiplied by the number of segments. In other words, the expectation is proportional to the length of the wire and we can write E(l)=kl for some constant k.

Now we know it's proportional to the length, we need to find the constant of proportionality, k. We need to 'calibrate' by thinking of a noodle shape where we know in advance exactly how many times it will cross the ruled lines. The following picture gives the solution:

A circle of diameter d will almost always cross the lines in two places. The length of this wire is πd so E(πd)=2 and k=2/πd.

The expected number of crossings for a wire of length l is 2l/πd. A needle of length l<d can intersect only zero or one times. So the expected value is in fact the probability of intersecting a line. The solution is 2l/πd.

No integrals needed.

The expected number of crossings is an example of an invariant measure, something I've talked about before. There are only a certain number of functions of a noodle that are additive and invariant under rotations and just knowing these facts is almost enough to pin down the solution.

Puzzle


Now I can leave you with a puzzle to solve. In the UK, a 50p coin is a 7 sided curvilinear polygon of constant width. Being constant width means a vending machine can consistently measure its width no matter how the coin is oriented in its plane. Can you use a variation of the argument above to compute the circumference of a 50p coin as a function of its width?


Tuesday, October 13, 2009

"What Category do Haskell Types and Functions Live In?"

The question in my title is one that is often raised by Haskell programmers and it's a difficult one to answer rigorously and satisfyingly. But you may notice that I've put the question in quotes. This is because I'm not asking the question myself. Instead I want to argue that often there's a better question to ask.

Superficially Haskell looks a lot like category theory. We have types that look like objects and functions that look like arrows. Given two functions we can compose them just how arrows compose in a category. We also have things that look like products, coproducts, other kinds of limit including infinite ones, natural transformations, Kan extensions, monads and quite a bit of 2-categorical structure.

So what goes wrong? (Besides the obvious problem that on a real computer, composing two working functions might result in a non-working function because you run out of memory.)

Among other things, Haskell functions can fail to terminate because of things like infinite loops. Computer scientists often use the notation ⊥ to represent a non-terminating computation. So when we talk of the Haskell integers, say, we don't just mean the values 0, 1, 2, ... but we also have to include ⊥. Unfortunately, when we do this we break a few things. For one thing we no longer have coproducts. But people find it useful to talk about algebraic datatypes as constructing types using products and coproducts and that would be a hard thing to give up.

So we could restrict ourselves to considering only the category theory of computable functions. But that's not a trivial thing to do either, and it doesn't reflect what real Haskell programs do.

But even if we did manage to tweak this and that to get a bona fide category out of Haskell, all we'd get is a custom tailored category that serves just one purpose. One theme running through much of my blog is that Haskell can be used to gain an understanding of a nice chunk of elementary category theory in general. Showing that Haskell simply gives us one example of a category really isn't that interesting. When I talked about the Yoneda Lemma I felt like I was talking about more than just one property of some obscure category that I can't actually define and that most category theorists have never even heard of.

So what's going on? Why does it feel like Haskell is so naturally category theoretical while the details are so messy?

Going back to my Yoneda lemma code, consider my definition of check

> check a f = fmap f a

It's straightforward to translate this into standard category theoretical notation that applies to any category. Even though the code is implemented in a specific programming language there's nothing about it that prevents it being translated for use in any category. So it doesn't matter what category Haskell corresponds to. What matters is that this bit of code is written in language suitable for any category. And the proof I give can be similarly translated.

Consider this standard problem given to category theory students: prove that (A×B)×C is isomorphic to A×(B×C). In Haskell we could construct the isomorphism as:

> iso :: ((a,b),c) -> (a,(b,c))
> iso ((x,y),z) = (x,(y,z))

But now we hit a problem. We can straightforwardly translate this into mathematical notation and it will give a valid isomorphism in the category of sets, Set. But iso is written to accept arguments which are elements of some type. Not all objects in categories have elements, and arrows might not correspond to functions. And even if they did, if we were working with (certain types of) topological spaces we'd be giving a construction for the isomorphism, and our proof would show the underlying function had an inverse, but we'd be failing to show it's continuous. It looks like writing Haskell code like this only tells us about a particularly limited type of category.

But not so. Type cabal install pointfree to install pointfree and then run pointfree 'iso ((x,y),z) = (x,(y,z))' and it responds with

> iso = uncurry (uncurry ((. (,)) . (.) . (,)))

pointfree rewrites a function in point-free style. There are no x's, y's or z's in the written version, only uncurry, composition (.), and the product function (,). These exist in all Cartesian closed categories (CCC). So our original function definition, despite apparently referring to elements, can be mechanically turned into a definition valid for any CCC. We can now reinterpret the meaning of x, y and z in the original definition as not referring to elements at all, but as labels indicating how a bunch of fairly general categorically defined primitives are to be assembled together.

(Incidentally, my first foray into pure functional programming was to write a SASL compiler. It was little more than a bunch of rewrite rules to convert SASL code into point-free compositions of S, K and I, among other combinators.)

What we have here is an example of an internal language at work. I'm not sure what a precise definition of "internal language" is, but it's something like this: take a formal system and find a way to translate it to talk about categories in such a way that true propositions in one are turned into true propositions in the other. The formal system now becomes an internal language for those categories.

The best known example is topos theory. A topos is a category that has a bunch of properties that make it a bit like Set. We take a subset of the language of set theory that makes use of just these properties. Our propositions that look like set theory can now be mechanically translated into statements valid of all toposes. This means we can happily write lots of arguments referring to elements of objects in a topos and get correct results.

In their book Introduction to Higher-Order Categorical Logic, Lambek and Scott showed that "pure typed λ-calculus" is the internal language of CCCs. Even though expressions in the λ-calculus contain named variables these can always be eliminated and replaced by point-free forms. Theorems about typed λ-calculus are actually theorems about CCCs. When we write Haskell code with 'points' in it, we don't need to interpret these literally.

So despite not knowing which category Haskell lives in, much of the code I've written in these web pages talks about a wide variety of categories because Haskell is essentially a programming language based on an internal language (or a bunch of them). Despite the fact that even a function like iso might have quite complex semantics when run on a real computer, the uninterpreted programs themselves often represent completely rigorous, and quite general pieces of category theory.

So the question to ask isn't "what category does Haskell live in?" but "what class of category corresponds to the internal language in which I wrote this bit of code?". I partly answer this question for do-notation (a little internal language of its own) in an earlier post. Haskell (and various subsets and extensions) is essentially a way to give semantics to internal languages for various classes of category. However complicated and messy those semantics might get on a real world computer, the language itself is a thing of beauty and more general than might appear at first.

BTW This trick of reinterpreting what look like variables as something else was used by Roger Penrose in his abstract index notation. Just as we can sanitise variables by reinterpreting them as specification for plumbing in some category, Penrose reinterpreted what were originally indices into arrays of numbers as plumbing in another category. Actually, this isn't just an analogy. With a little reformatting abstract index notation is very close to the way I've been using monads to work with vector spaces so that abstract index notation can be viewed as a special case of an internal language for categories with monads.



Saturday, October 10, 2009

Vectors, Invariance, and Math APIs

Many software libraries, especially those for physics or 3D graphics, are equipped with tools for working with vectors. I'd like to point out how in these libraries the functions for manipulating vectors sometimes have special and useful properties that make it worthwhile to separate them out into their own interface.

Two types of Function


Suppose an object of mass m is moving with velocity v and we apply force f to it for time t. What is the final velocity? This is given by g:

g(m,t,f,v) = v+(t/m)*f

Now suppose that R is a rotation operation, typically represented by a matrix. What happens if we apply it to both of the vector arguments of g?

g(m,t,Rf,Rv) = Rv+(t/m)*Rf = Rg(m,t,f,v)

In other words, rotating the vector arguments is the same as rotating the vector result.

Another example: Consider the function that gives the force on an electric charge as a function of its velocity and the magnetic field:

f(e,v,B) = ev×B

It's essentially just the cross product. If you rotate both of the arguments to the cross product then the result is rotated too. The result is that

f(e,Rv,RB) = Rf(e,v,B)


On the other hand, many 3D APIs come with a function to perform componentwise multiplication of vectors. Write vectors x as triples (x0,x1,x2), and so on, we can write such a function as:

f(x,y) = (x0y0,x1y1,x2y2)

You can show that this doesn't have a similar property.

Rotational Invariance


To make things easy, let's restrict ourselves to functions of scalars and vectors. And when I say vector, I'm talking strictly about vectors representing magnitude and direction, but not positions. Examples of such vectors are velocities, accelerations, angular velocities, magnetic fields, and the difference between two positions. A function is said to be rotationally invariant if applying a rotation R to all of its vector arguments results in the same thing as applying R to all of the vectors in its value. This allows you to have a function that returns multiple vectors, like a tuple or array.

The first two functions I described above were rotationally invariant but the third wasn't. Notice how the first two examples also described physical processes. This is the important point: as far as we know, all of the laws of physics are rotationally invariant. If you write down an equation describing a physical process then replacing all of the vectors in it by their rotated counterparts must also result in a valid equation. So if you're describing a physical process with a computer program, and you end up with a function that isn't rotationally invariant, you've made a mistake somewhere.

Vector APIs


Vector APIs frequently come with all manner of functions. Some have the invariance property and some don't. If you write code that you'd like to be rotationally invariant, but it turns out that it isn't, you usually have to examine the code to find the bug. But if you separate the invariant functions into their own interface, and then write code using just that interface, the code is guaranteed to be invariant. If your programming language has reasonably strict types then you may even be able to arrange things so that the type signature of the function alone is enough to tell you that the function is invariant. In effect you are able to make the compiler prove that your function is invariant.

(As an aside, this is an example of why a good type system does much more than you might at first have guessed. They don't just stop you making typos, they can do things like prove that your programs satisfy certain geometrical properties.)

So what functions would you have in such an API? Among the essential rotationally invariant functions are:

1. Multiplcation of a vector by a scalar
2. Addition of vectors
3. Dot product
4. Cross product

In terms of these you can build functions such as

1. Vector length
2. Vector normalization
3. Rotation of one vector around an axis specified by another vector
4. Linear interpolation between vectors

What kinds of functions would be excluded?

1. Constructing a vector from three scalars, ie. f(x,y,z) = (x,y,z).
2. Constructing a vector form a single scalar, ie. f(x) = (x,x,x).
3. Extracting the ith component of a vector, ie. f(i,(x0,x1,x2)) = xi.
4. Pointwise multiplication of vectors.
5. Computing the elementwise cosine of a vector.

On seeing the first excluded example above you might ask "how am I supposed to construct vectors?" The point is that you don't program exclusively with an invariant API, you simply use it whenever you need to prove invariance.

Coordinate System Invariance


One purpose of writing to a particular interface is that it allows the API to hide implementation details from the user. Using a rotationally invariant API has a role to serve here. For example, many 3D renderers allow you to write shaders. These are essentially functions that compute the colour of a piece of geometry that needs rendering. You write a shader and the renderer then calls your shader as needed when a fragment of geometry passes through its pipeline. Frequently these are used for lighting calculations but there are all kinds of other things that may be computed in shaders.

In a 3D renderer different parts of the computation are often performed in different coordinate systems. For example it may be convenient to perform lighting calculations in a coordinate system oriented with the direction of the light. But the author of a renderer doesn't want to be committed to a particular choice. In order to do this, it is essential to be able to write shaders that are agnostic about which coordinate system is being used. If we work with rotationally invariant functions, our shaders are guaranteed to be agnostic in this way (assuming that the only kind of coordinate change that takes place is a rotation).

Lots More Types


I've concentrated on just one type of invariance, rotational invariance. If we consider more types of invariance then more types of interface naturally emerge. It would take too long to cover all of the details here so I'm just going to briefly sketch the beginnings of the more general case. So just read this section as a list of pointers to further reading.

For example, some functions are invariant under translations. These can be thoght of as functions of points in space. If we allow more general linear transformations then we find that some common functions transform 'oppositely' to vectors. In particular, normals to surfaces transform in this way. In fact, Pixar's Renderman has three distinct types, vectors, points and normals that captures these different invariances.

If we go back to rotations again but now extend these by allowing reflections then we find an interesting new phenomenon. For example, consider the result of reflecting in the x-y-plane, followed by reflecting in the y-z-plane followed by reflecting in the x-z-plane. This simply multiplies vectors by -1. Dot product is invariant under this: (-x)·(-y)=x·y. But cross product isn't because (-x)×(-y)=x×y. Even though the cross product is apparently vector, it doesn't get multiplied by -1. When we start considering invariance under reflection we find that some vectors behave differently. These are the pseudovectors and in effect they have their own separate type and interface. Interestingly, nature likes to keep pseudovectors and vectors separate except in parity violating phenomena. There are even pseudoscalars.

Incidentally, if you consider invariance under scaling you're led to the idea of encoding dimensions in your types.

Conclusion


If you're writing a vector API think about the invariance properties that your functions may have. If any are useful then it may be worth placing those in a separate interface. The more distinct types you have, the more properties you can make your compiler prove. Obviously this needs to be balanced against practicality, complexity for users and what you actually need. To some extent, many existing APIs make some of these distinctions with varying degrees of strictness. The main point I want to make clear is that these distinctions are based on invariance properties, something that not all developers of such APIs are aware of.

At some point I hope to return to this topic and enumerate all of the common vector-like types in a single framework. Unfortunately it's a big topic and I've only been able to scratch the surface here. In particular there are some subtle interplays between dimensions and types.

On a deeper level, I think there must be some type theoretical framework in which these invariance properties are free theorems.

Update: I believe some of this material is covered in Jimm Blinn's Vectors and Geometry and Objects, Oh My!, but I don't have access to that. I suspect that there is one big difference in my presentation: I'm not so interested here in vectors (or normals or whatever) in themselves but as defining interfaces to functions with invariance properties. Like the way category theorists shift the focus from objects to arrows. It makes a difference because it immediately gives theorems that our code is guaranteed to satisfy. It's the invariance property of the cross product (say) that is useful here, not the fact that the components of a vector transform a certain way when we change coordinates (because I might not even want to refer to coordinates in my code).

Example Code



To show that I'm talking about something very simple, but still powerful, here's some Haskell code:


> data Triple = T Float Float Float deriving Show

> class Vector v where
> (.+) :: v -> v -> v
> (.*) :: Float -> v -> v
> dot :: v -> v -> Float
> cross :: v -> v -> v

> instance Vector Triple where
> T x y z .+ T u v w = T (x+u) (y+v) (z+w)
> a .* T x y z = T (a*x) (a*y) (a*z)
> dot (T x y z) (T u v w) = x*u+y*v+z*w
> cross (T x y z) (T u v w) = T
> (y*w-v*z)
> (z*u-x*w)
> (x*v-y*u)


You can freely apply the four primitive functions to elements of type Triple but if you have a function of, say, signature


> f :: Vector v => (v,v,Float) -> [(v,v)]


you are guaranteed it is invariant.

Tuesday, September 29, 2009

test, ignore


\int_{0}^{1}\frac{x^{4}\left(1-x\right)^{4}}{1+x^{2}}dx
=\frac{22}{7}-\pi


You know, I think I won't delete this. It's a celebration of my having embedded some TeX using the advice at Bot Cyborg.

Saturday, September 26, 2009

Finite Differences of Types

Finite Differences of Real-Valued Functions


Conor McBride's discovery that you can differentiate container types to get useful constructions like zippers has to be one of the most amazing things I've seen in computer science. But seeing the success of differentiation suggests the idea of taking a step back and looking at finite differences.

Forget about types for the moment and consider functions on a field R. Given a function f:R→R we can define Δf:R×R→R by


Δf(x,y) = (f(x)-f(y))/(x-y)


Δ is the finite difference operator. But does it make any kind of sense for types? At first it seems not because we can't define subtraction and division of types. Can we massage this definition into a form that uses only addition and multiplication?

First consider Δc where c is a constant function. Then Δc(x,y)=0.

Now consider the identity function i(x)=x. Then Δi(x,y)=1.

Δ is linear in the sense that if f and g are functions, Δ(f+g) = Δf+Δg.

Now consider the product of two functions, f and g.


Δ(fg)(x,y) = (f(x)g(x)-f(y)g(y))/(x-y)

= (f(x)g(x)-f(x)g(y)+f(x)g(y)-f(y)g(y))/(x-y)

= f(x)Δg(x,y)+g(y)Δf(x,y)


So now we have a Leibniz-like rule. We can compute finite differences of polynomials without using subtraction or division! What's more, we can use these formulae to difference algebraic functions defined implicitly by polynomials. For example consider f(x)=1/(1-x). We can rewrite this implicitly, using only addition and multiplication, as

f(x) = 1+x f(x)

Differencing both sides we get

Δf(x,y) = x Δf(x,y)+f(y)

That tells us that Δf(x,y) = f(x)f(y).

Finite Differences of Types


We're now ready to apply our operator to types. Instead of functions on the reals we work with functors on the set of types. A good first example container is the functor F(X)=XN for an integer N. This is basically just an array of N elements of type X. We could apply the Leibniz rule repeatedly, but we expect to get the same result as if we'd worked over the reals. So setting f(x)=xN we get


Δf(x,y) = (xN-yN)/(x-y) = xN-1+xN-2y+xN-3y2+...+yN-1

So we know that on types, ΔF(X,Y) = XN-1+XN-2Y+...+YN-1.

There's a straightforward interpretation we can give this. Differentiating a type makes a hole in it. Finite differencing makes a hole in it, but everything to the left of the hole is of one type and everything on the right is another. For example, for F(X)=X3, ΔF(X,Y)=X2+XY+Y2 can be drawn as:

If you've been reading the right papers then at this point it should all become familiar. Finite differencing is none other than dissection, as described by Conor in his Jokers and Clowns paper. I don't know if he was aware that he was talking about finite differences - the paper itself talks about this being a type of derivative. It's sort of implicit when he writes the isomorphism:

right :: p j + (Δp c j , c) → (j , Δp c j ) + p c

With a little rearrangement this becomes the definition of finite difference.

Now that we've recognised dissection as finite difference we can reason informally about dissection using high school algebra. For example, we already know that lists, defined by L(X) = 1+X L(X) can be informally thought of as L(X)=1/(1-X). So using the example I gave above we see that ΔL(X,Y)=1/((1-X)(1-Y)) = L(X)L(Y). So the dissection of a list is a pair of lists, one for the left elements, and one for the right elements. Just what we'd expect.

Another example. Consider the trees defined by T(X)=X+T(X)2. Informally we can interpret this as T(X)=(1+√(1-4X))/2. A little algebraic manipulation, using (√x-√y)(√x+√y) = x-y shows that

ΔT(X,Y) = 1/(1-(T(X)+T(Y))


In other words, a dissection of a tree is a list of trees, each of which is a tree of X or a tree of Y. This corresponds to the fact that if you dissect a tree at some element, and then follow the path from the root to the hole left behind, then all of the left branches (in blue) are trees of type X and all of the right branches (in red) are trees of type Y.

If you're geometrically inclined then you can think of types with holes in them as being a kind of tangent to the space of types. Along those lines, dissections become secants. I think this geometric analogy can be taken a lot further and that in fact a non-trivial piece of differential geometry can be made to work with types. But that's for another day.

Oh, I almost forgot. Derivatives are what you get when you compute finite differences for points really close to each other. So I hope you can see that Δf(x,x)=df/dx giving us holes in terms of dissections. Conor mentions this in his paper.

We should also be able to use this approach to compute finite differences in other algebraic structures that don't have subtraction or division.

I can leave you with some exercises:

1. What does finite differencing mean when applied to both ordinary and exponential generating functions?

2. Can you derive the "chain rule" for finite differences? This can be useful when you compute dissections of types defined by sets of mutually recursive definitions.

3. Why is right, defined above, a massaged version of the definition of finite difference? (Hint: define d=((f(x)-f(y))/(x-y). In this equation, eliminate the division by a suitable multiplication and eliminate the subtraction by a suitable addition. And remember that (,) is Haskell notation for the product of types.)

Sunday, September 13, 2009

More Parsing With Best First Search


> {-# LANGUAGE NoMonomorphismRestriction,GeneralizedNewtypeDeriving #-}


I have three goals in this post:

1. Refactoring the technique in my previous post so that building the search tree is entirely separate from searching the tree.
2. Making it work with real-valued weights, not just integers
3. Applying it to an ambiguous parsing problem, making use of a type class to define an abstract grammar.


> import Control.Arrow
> import Control.Monad
> import Control.Monad.Instances
> import Control.Monad.State
> import Data.Either
> import Data.Function
> import Random
> import qualified Data.List as L
> import qualified Data.Map as M


Search Trees


The idea is that I want to search a tree of possibilities where each edge of the tree is marked with a weight. The goal will be to search for leaves that minimise the sum of the weights of the edges down to the leaf.

Here's an example tree:

The minimum weight leaf is at C. If we're working with probabilities then we'll use minus the log of the probability of a branch as the weight. That way multiplication of probabilities becomes additions of weights, and the likeliest leaf has the minimum weight path.

So here's the definition of a search tree. I've given both leaves and edges weights:


> data Search c a = Leaf { lb::c, leaf::a}
> | Choice { lb::c, choices::[Search c a] } deriving Show


(Compare with this.) lb is short for 'lower bound'. It provides a lower bound for the total weight of any option in this subtree (assuming non-negative weights). The tree in the diagram would look like:


> ex1 = Choice 0 [
> Choice (-log 0.1) [
> Leaf (-log 0.5) 'A',
> Leaf (-log 0.5) 'B'],
> Choice (-log 0.2) [
> Leaf (-log 0.6) 'C',
> Leaf (-log 0.4) 'D']]


This tree is a container in a straightforward way and so we can make it an instance of Functor:


> instance Functor (Search c) where
> fmap f (Leaf c a ) = Leaf c $ f a
> fmap f (Choice c as) = Choice c $ map (fmap f) as


But it's also a monad. >>= maps all of the elements of a tree to trees in their own right, and then grafts those trees into the parent tree:


> instance Num c => Monad (Search c) where
> return = Leaf 0
> a >>= f = join $ fmap f a where
> join (Leaf c a ) = Choice c [a]
> join (Choice c as) = Choice c $ map join as


It's easy to make trees into a MonadPlus by simply grafting trees into a new root. MonadPlus is meant to be a monoid, but this operation, as written below, isn't precisely associative. But it's 'morally' associative in that two terms that are meant to be equal describe equivalent search trees. So I'm not going to lose any sleep over it:


> instance Num c => MonadPlus (Search c) where
> mzero = Choice 0 []
> a `mplus` b = Choice 0 [a,b]


For our searching we'll need a priority queue. I'll use a skew tree with code I lifted from somewhere I've forgotten:


> data Ord a => Tree a = Null | Fork a (Tree a) (Tree a) deriving Show

> isEmpty :: Ord a => Tree a -> Bool
> isEmpty Null = True
> isEmpty (Fork x a b) = False

> minElem :: Ord a => Tree a -> a
> minElem (Fork x a b) = x

> deleteMin :: Ord a => Tree a -> Tree a
> deleteMin (Fork x a b) = merge a b

> insert :: Ord a => a -> Tree a -> Tree a
> insert x a = merge (Fork x Null Null) a

> merge :: Ord a => Tree a -> Tree a -> Tree a
> merge a Null = a
> merge Null b = b
> merge a b
> | minElem a <= minElem b = connect a b
> | otherwise = connect b a

> connect (Fork x a b) c = Fork x b (merge a c)


At each stage in the search we'll pick the unexplored branch with the lowest total weight so far. So when we compare trees we'll compare on their lower bounds. So we need an ordering on the trees as follows:


> instance (Num c) => Eq (Search c a) where
> (==) = (==) `on` lb

> instance (Num c,Ord c) => Ord (Search c a) where
> compare = compare `on` lb


The real cost of a choice isn't just the weight immediately visible in the tree but the cost of the journey you took to get there. We use the bumpUp function to put that extra cost into the part of the tree we're currently looking at:


> bumpUp delta (Leaf c a) = Leaf (delta+c) a
> bumpUp delta (Choice c as) = Choice (delta+c) as


The only tricky aspect to this code is that we need to be able to handle infinite trees. We can't have our code simply go off and return when it's found the next match because it might not be possible to do so in a finite time. Instead, the code needs to perform one operation at a time and report what it found at each point, even if that report is just stalling for time. We do this by returning a (possibly infinite) list containing elements that are either (1) the next item found or (2) a new update giving more information about the lower bound of the cost of any item that might be yet to come. This allows the caller to bail out of the search once the cost has passed a certain threshold.

(Returning a useless looking constructor to stall for time is a common design pattern in Haskell. It's an example of how programs that work with codata need to keep being productive and you get something similar with the -|Skip|- in Stream Fusion. First time I write the code I failed to do this and kept wondering why my infinite searches would just hang, despite my great efforts to make it as lazy as possible.)


> runSearch :: (Num c,Ord c) => Tree (Search c a) -> [Either c a]
> runSearch Null = []
> runSearch queue = let
> m = minElem queue
> queue' = deleteMin queue
> in case m of
> Leaf c a -> Left c : Right a : runSearch queue'
> Choice c as -> Left c : (runSearch $ foldl (flip insert) queue' $ map (bumpUp c) as)


A quick test of an infinite search: finding Pythagorean triples by brute force. We give each integer as cost one more than the previous one:

I guess this is actually Dijkstra's algorithm, but on a tree rather than a general graph.


> integers m = Choice 1 [Leaf 0 m,integers (m+1)]

> test = do
> a <- integers 1
> b <- integers 1
> c <- integers 1
> guard $ a*a+b*b==c*c
> return (a,b,c)

> test1 = runSearch (insert test Null)


If you run test1 you'll notice how the output is noisy because of all those Left w terms. If you'e not worried about non-termination you could just throw out redundant output like so:


> reduce [] = []
> reduce (Left a : Left b : bs) = reduce (Left b : bs)


Might as well convert weights to probabilities while we're at it:


> reduce (Left a : bs) = Left (exp (-a)) : reduce bs
> reduce (Right a : bs) = Right a : reduce bs


This version should be a lot less chatty:


> test2 = reduce test1


Grammar


Now that searching works I can turn to an application - a more sophisticated example of what I briefly looked at previously), parsing with ambiguous grammars. So let me first build types to represent parsed sentences in a toy grammar:


> data Noun = Noun String deriving (Show,Eq,Ord)
> data Verb = Verb String deriving (Show,Eq,Ord)
> data Adj = Adj String deriving (Show,Eq,Ord)
> data Prep = Prep String deriving (Show,Eq,Ord)


The following two are noun phrase and prepositional phrase:


> data NP = NP [Adj] Noun deriving (Show,Eq,Ord)
> data PP = PP Prep Noun deriving (Show,Eq,Ord)


And entire sentences:


> data Sentence = Sentence [NP] Verb [NP] [PP] deriving (Show,Eq,Ord)


We want to be able to print parsed sentences so here's a quick 'unparse' type class to recover the underlying string:


> class UnParse a where
> unParse :: a -> String

> instance UnParse Noun where
> unParse (Noun a) = a

> instance UnParse Verb where
> unParse (Verb a) = a

> instance UnParse Adj where
> unParse (Adj a) = a

> instance UnParse Prep where
> unParse (Prep a) = a

> instance UnParse NP where
> unParse (NP a b) = concatMap unParse a ++ unParse b

> instance UnParse PP where
> unParse (PP a b) = unParse a ++ unParse b

> instance UnParse Sentence where
> unParse (Sentence a b c d) = concatMap unParse a ++ unParse b ++ concatMap unParse c ++ concatMap unParse d


Now I'm going to approach the problem of parsing ambiguous sentences in two ways. One will be efficient, and one will be inefficient but represent the 'ground truth' against which we'll compare. (This reflects standard practice in graphics publications where authors compare their fancy new algorithm with an ultra-slow but reliable Monte Carlo ray-tracer.)

I'm going to assume that sentences in my language are described by a "context free" probability distribution so that a noun phrase, say, has a fixed probability of being made up of each possible combination of constituents regardless of the context in which it appears.

I need an English word for something that takes a grammar and does something with it but I'm at a loss to think of an example. I'll use 'transducer', even though I don't think that's right.

So a transducer is built from either terminal nodes of one character, or it's one of a choice of transducers, each with a given probability:


> class Transducer t where
> char :: Char -> t Char
> choose :: [(Float,t a)] -> t a


And here's our toy grammar. It's nothing like an actual natural language because real grammars take a long time to get right. Note I'm just giving the first couple of type signatures to show that the grammar uses only the Monad and Transducer interfaces:


> string :: (Monad t, Transducer t) => [Char] -> t [Char]
> string "" = return ""
> string (c:cs) = do {char c; string cs; return (c:cs)}


So, for example, a noun has a 50% chance of being the string ab and a 50% chance of being the string ba:


> noun :: (Monad t, Transducer t) => t Noun
> noun = do
> a <- choose [(0.5,string "ab"),(0.5,string "ba")]
> return $ Noun a

> verb :: (Monad t, Transducer t) => t Verb
> verb = do
> a <- choose [(0.5,string "aa"),(0.5,string "b")]
> return $ Verb a

> adjective :: (Monad t, Transducer t) => t Adj
> adjective = do
> a <- choose [(0.5,string "ab"),(0.5,string "aa")]
> return $ Adj a

> parsePrep = do
> a <- choose [(0.5,string "a"),(0.5,string "b")]
> return $ Prep a


Some of our "parts of speech" allow sequences of terms. We need some kind of probabilistic model of how many such terms we can expect. I'm going to assume the probability falls off exponentially with the number of items:


> many :: (Monad t, Transducer t) => Float -> t a -> t [a]
> many p t = choose [
> (p,return []),
> (1-p,do
> a <- t
> as <- many p t
> return $ a:as)]


I also have a convenience function for sequences of length at least 1:


> many1 p t = do
> a <- t
> as <- many p t
> return (a:as)


And now the rest of the grammar:


> parseNP = do
> a <- many 0.5 adjective
> b <- noun
> return $ NP a b

> parsePP = do
> a <- parsePrep
> b <- noun
> return $ PP a b

> sentence = do
> a <- many 0.5 parseNP
> b <- verb
> c <- many 0.5 parseNP
> d <- many 0.5 parsePP
> return $ Sentence a b c d


We're going to use this grammar with two instances of type Transducer. The first will use the rules of the grammar as production rules to generate random sentences. The second will parse strings using the grammar. So we get two uses from one 'transducer'. This is pretty powerful: we have described the grammar in an abstract way that doesn't asuume any particular use for it.


> newtype Generator a = Generator { unGen :: State StdGen a } deriving Monad
> newtype Parser a = Parser { runParse :: (String -> Search Float (a,String)) }


Let's implement the generation first:


> instance Transducer Generator where
> char a = return a
> choose p = do
> r <- Generator (State random)
> case (L.find ((>=r) . fst) $ zip (scanl1 (+) (map fst p)) (map snd p)) of
> Just opt -> snd opt


We can test it by generating a bunch of random sentences:


> gen = mkStdGen 12343210
> generate n partOfSpeech = (unGen $ sequence (replicate n partOfSpeech)) `evalState` gen

> test3 = mapM_ print $ generate 10 sentence


We can now use generate-and-test to estimate what proportion of randomly generated sentences match a given sentence:


> generateAndTest n partOfSpeech chars = do
> a <- generate n sentence
> guard $ unParse a == chars
> return a

> collectResults n partOfSpeech chars = M.fromListWith (+) $ map (flip (,) 1) $
> generateAndTest n partOfSpeech chars
> countResults n partOfSpeech chars = mapM_ print $ L.sortBy (flip compare `on` snd) $
> M.toList $ collectResults n partOfSpeech chars

> test4 = countResults 100000 (noun :: Parser Noun) "abab"


On the other hand we can build a parser, based on Hutton's, just like in my previous post except using this new tree search monad:


> instance Monad Parser where
> return a = Parser (\cs -> return (a,cs))
> p >>= f = Parser (\cs -> do
> (a,cs') <- runParse p cs
> runParse (f a) cs')

> instance MonadPlus Parser where
> mzero = Parser (\cs -> mzero)
> p `mplus` q = Parser (\cs -> runParse p cs `mplus` runParse q cs)

> instance Transducer Parser where
> char c = Parser $ char' where
> char' "" = mzero
> char' (a:as) = if a==c then return (a,as) else mzero
> choose p = foldl1 mplus $ map (\(p,x) -> prob p >> x) p where
> prob p = Parser (\cs -> Leaf (-log p) ((),cs))

> goParse (Parser f) x = runSearch $ insert (f x) Null

> end = Parser (\cs -> if cs=="" then return ((),"") else mzero)

> withEnd g = do
> a <- g
> end
> return a

> normalise results = let total = last (lefts results)
> in map (\x -> case x of
> Left a -> a / total
> Right b -> b
> ) results

> findParse chars = mapM_ print $ reduce $ runSearch $
> insert (runParse (withEnd sentence) chars) Null


Results


And now we can try running both methods on the same string:


> main = do
> let string = "ababbbab"
> findParse string
> print "-------------------"
> countResults 1000000 (sentence :: Parser Sentence) string


You should see the parsings from countResults in roughly the same proportion as the relative probabilities given by findParse. Remember that the relative probability of a given parsing is the last Left p term before that parsing. Try playing with string, the number of Monte Carlo runs and the seed. Remember that there is going to be some variation in the randomised algorithm, especially with hard to parse strings, but raising the number of runs will eventually give reasonable numbers. Of course ultimately we don't care about the Monte Carlo method so it's allowed to be slow.

Anyway, none of this is a new algorithm. You can find similar things in papers such as Probabilistic tree transducers and A Generalization of Dijkstra's Algorithm. But what is cool is how easily Haskell allows us to decouple the tree building part from the searching part. (And of course the tree is never fully built, it's built and destroyed lazily as needed.) All of the published algorithms have the parsing and searching hopelessly interleaved so it's hard to see what exactly is going on. Here the search algorithm doesn't need to know anything about grammars, or even that it is searching for parsings.

Semiring Parsing is also easy to implement this way.

BTW If you think my "ab" language is a bit to contrived, check out the last picture here for an example of some natural language that is in a similar spirit :-)

Saturday, July 04, 2009

A Monad for Combinatorial Search with Heuristics

Haskell provides a great way to perform combinatorial searching with backtracking: the list monad. Do-notation provides a nice DSL that makes it easy to express the trying out of different possibilities. But the list monad only performs a simple-minded walk through all of the alternatives giving little opportunity to direct that walk. In particular, it's not easy to provide heuristics to say things like "try this alternative first but if it starts going badly consider this alternative too". This post contains a monad that gives a simple scheme to allow programmers to direct searches in this way.

First the Haskell administrativia...


> import Data.Char
> import Control.Monad
> import Data.Monoid
> import Data.List


When using the list monad, a list is interpreted as a list of candidates in a search. The join function for this monad takes a list of lists of candidates and flattens it into a list of candidates. This is all the list monad really does: you write code that generates new candidates from old, and the >>= function applies this code to all of the candidates it knows about and then flattens this back out to a single list of candidates. Importantly it does this in a lazy way so that you only need look at candidates as they are generated.

This new monad will keep slightly more information: each candidate will have a 'penalty' value attached to it saying how attractive a candidate it is. Candidates with score 0 will be tried first, and those with score n will be tried after those with lower scores. We can represent a collection of candidates and their scores simply as a list of lists. The first list in the list will have those with score 0, the second will have those with score 1 and so on. We'll call these lists penalty lists and the positions within those lists slots.

Here's the definiton of the penalty list type:


> data PList a = P { unO :: [[a]] } deriving (Show,Eq)


It's a functor in a straightforward way:


> instance Functor PList where
> fmap f (P xs) = P (fmap (fmap f) xs)


The rule we'll adopt is that if you're trying a combination of two candidates then the penalty associated with the combination is the sum of the penalties of the individual objects. To implement this we need an alternative version of the join operation. If we have a penalty list of penalty lists and we have an element in the mth slot in the nth penalty sublist then we want it to end up in the (m+n)th slot in the final penalty list. Within a slot we can just order the elements just like in the original list monad.


> headm :: Monoid m => [m] -> m
> headm (a:as) = a
> headm [] = mempty

> tailm :: Monoid m => [m] -> [m]
> tailm (a:as) = as
> tailm [] = []

> zipm :: Monoid m => [[m]] -> [m]
> zipm ms | all null ms = []
> zipm ms = let
> heads = map headm ms
> tails = map tailm ms
> h = mconcat heads
> t = zipm (filter (not . null) tails)
> in h : t

> instance Monad PList where
> return x = P [[x]]
> x >>= f = let P xs = (fmap (unO . f) x) in P (join xs) where
> join [] = []
> join (m:ms) = let
> part1 = zipm m
> part2 = join ms
> in headm part1 : zipm [tailm part1,part2]


Explaining how join is implemented would take many words so I hope this picture of the computation of an example will do instead. I used the Monoid class simply to avoid directly referring to one level of nesting of brackets. It is intended to be a proper implementation of a Monad satisfying the three monad laws but I haven't proved this and it's possible that it occasionally leaves trailing empty lists around - which have no impact on search results.



> instance MonadPlus PList where
> mzero = P []
> mplus (P xs) (P ys) = P (zipm [xs,ys])


We can use this much like the list monad. First it will search for possibilities with zero penalty. When these are exhausted it'll backtrack to the last place where it can start finding possibilities with penalty 1. Then it'll try penalty 2 and so on. Importantly it manages to do this lazily so that we don't explore penalty n+1 until we've finished penalty n.

So now we can start using it. We'll hunt for Pythagorean triples by simply hunting through all of the triples of integers. But we'll try to find solutions where the sum of the integers is as small as possible. So as list of candidate integers we use P [[1],[2],[3]...]. In other words, the integer n has penalty n-1. Here's the code:


> ex1 = do
> x <- P $ map (\x -> [x]) [1..]
> y <- P $ map (\y -> [y]) [1..]
> z <- P $ map (\z -> [z]) [1..]
> guard $ x*x+y*y==z*z
> return $ (x,y,z)


Of course we wouldn't really search for Pythagorean triples this way. This is just an illustration of how to use the code. But note, crucially, that the equivalent code using the regular list monad would give us back no solutions. It'd start with x=1 and y=1 and then go off to infinity finding candidates for z. So as a side effect the penalty list allows us to tame some infinite searches.

Anyway, that was a simple numerical example. But this monad can be used with much more complex kinds of search. In fact it almost serves as a drop-in replacement for the list monad. This is a really nice example of the way separation of concerns is easy in Haskell. The task of generating candidates for search can easily be separated from the task of selecting from those candidates, even though the operations are highly interleaved during execution.

So here's a more complex example: writing a parser that can tolerate errors without running into combinatorial explosion. The idea is that we associate a penalty with each error. The penalty will make the parser run on the assumption of no errors until it can no longer parse, and then it'll backtrack on the assumption of one error until that assumption is no longer tenable and so on. We can liberally sprinkle 'erroneous' parsings throughout our code confident that these branches will only be taken in the event that an error-free parsing can't be found.

Firstly, here's a penalty list that we can use to introduce a penalty of just 1.


> penalty :: PList ()
> penalty = P [[],[()]]


If we stick that in the code path then anything following acquires a penalty of 1.

Now we can write a parser. We can implement Hutton's parser in his monad parsers paper with very little modification. We simply replace the usual list with the penalty list and do away with the +++ operator to allow it to be a bit more liberal about backtracking. Here's the parser type:


> newtype Parser a = Parser (String -> PList (a,String))


We could have parameterised that with the underlying monad so that we could have parsers with a choice of search strategy.

The rest is a lot like in Hutton's paper:


> parse (Parser f) x = f x

> instance Monad Parser where
> return a = Parser (\cs -> P [[(a,cs)]])
> p >>= f = Parser (\cs -> do
> (a,cs') <- parse p cs
> parse (f a) cs')

> instance MonadPlus Parser where
> mzero = Parser (\cs -> mzero)
> p `mplus` q = Parser (\cs -> parse p cs `mplus` parse q cs)

> item :: Parser Char
> item = Parser (\cs -> case cs of
> "" -> mzero
> (c:cs) -> P [[(c,cs)]])

> sat :: (Char -> Bool) -> Parser Char
> sat p = do
> c <- item
> if p c then return c else mzero

> char :: Char -> Parser Char
> char c = sat (c ==)


Now for a simple parsing problem. We'll parse simple arithmetical expressions a lot like in Hutton's paper. But I'm going to tolerate two kinds of error:
1. The shift key doesn't always work so occasionally a shifted or unshifted version of a character may appear and
2. parentheses are occasionally left out by the clumsy user.

Now we can code up a simple grammar for this. First the mapping between shifted and unshifted characters (on a Mac US keyboard):


> lowers = "1234567890-=/"
> uppers = "!@#$%^&*()_+?"
> lower x = lookup x (zip uppers lowers)
> upper x = lookup x (zip lowers uppers)

> upperChar x = case upper x of
> Nothing -> mzero
> Just y -> char y >> return x

> lowerChar x = case lower x of
> Nothing -> mzero
> Just y -> char y >> return x


A version of penalty wrapped for the parser monad:


> avoid :: Parser ()
> avoid = Parser $ \cs -> do
> penalty
> return ((),cs)


Reading keys on the assumption that the shift key may have failed:


> keyChar x = char x `mplus` (avoid >> upperChar x) `mplus` (avoid >> lowerChar x)

> digit = do
> x <- foldl mplus mzero (map keyChar "0123456789")
> return (fromIntegral (ord x-ord '0'))

> number1 :: Integer -> Parser Integer
> number1 m = return m `mplus` do
> n <- digit
> number1 (10*m+n)

> number :: Parser Integer
> number = do
> n <- digit
> number1 n

> chainl :: Parser a -> Parser (a -> a -> a) -> a -> Parser a
> chainl p op a = (p `chainl1` op) `mplus` return a
> chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
> p `chainl1` op = do {a <- p; rest a}
> where
> rest a = (do
> f <- op
> b <- p
> rest (f a b)) `mplus` return a


Optional parentheses:


> shouldHave c = keyChar c `mplus` (avoid >> return c)


And the main part of the expression grammar:


> expr = term `chainl1` addop
> term = monomial `chainl1` mulop
> monomial = factor `chainl1` powop
> factor = number `mplus` do {shouldHave '('; n <- expr; shouldHave ')'; return n}
> powop = keyChar '^' >> return (^)
> addop = do {keyChar '+'; return (+)} `mplus` do {keyChar '-'; return (-)}
> mulop = do {keyChar '*'; return (*)} `mplus` do {keyChar '/'; return (div)}


Match the end of a string:


> end :: Parser ()
> end = Parser $ \cs ->
> if null cs then P [[((),"")]] else mzero


We can test it out with:


> completeExpr = do
> n <- expr
> end
> return n

> ex2 = parse completeExpr "2^(1+3"


When we run this we get no error-free parsing but we do get 3 readings with one error. One comes from reading the '(' as 9, one comes from inserting the missing ')' at the end and one comes from inserting ')' after '1'. Note that even for complex expressions we'll quickly find a 1- or 2-error parsing. For the regular list monad we might never get a parsing because there are an infinite number of ways of inserting parentheses.

Anyway, that was just a toy parsing problem. But a more complex application comes to my mind. Some written languages are tricky to parse because their orthography doesn't fully capture the phonetics of the original language, because there are few or no indicators of sentence or even word breaks, and because they have numerous optional orthographic and grammatical rules and use a script whose individual characters are occasionally hard to reliably identify. In such a situation it's good to have a parser driven by heuristics about what is likely to be intended and the penalty list monad might serve well. Here's an example of such a language.

Update: I forgot to add some connections to previous monads I've talked about:

  1. PList is a variation of the convolution monad I described here. It deals with the "wrong category" aspect so it is a true Haskell monad. Penalty lists form some kind of dual to the convolution comonad.
  2. It has much in common with this monad. That monad doesn't do anything smart about ordering searches but it does have the neat ability to 'fuse' different branches of a search so that different ways to arrive at the same place don't add to the combinatorial explosion. It's good for searches where you want to know what the minimum penalty is to get somewhere, but don't care what the best path actually is.

Also, in response to a comment on #haskell I've made the join example more complex so it's easier to generalise from it.

Saturday, June 20, 2009

Automata and the A-D-E classification.

Introduction


The A-D-E classification is a strange ubiquitous pattern that appears in many branches of mathematics. Typically it appears when you try to classify certain types of mathematical construction. If the A-D-E classification applies then you end up with two infinite sequences of cryptically named objects (A1,A2,A3,...) and (D1,D2,D3,...) as well as three leftover objects called E6, E7 and E8. Unfortunately, most of these objects and their classifications are tricky to define using only elementary mathematics. However, there is one type of object that is classified in this way that can be given a relatively straightforward computational description involving a little linear algebra and assuming you know a tiny bit about automata.

But first: why care about the A-D-E classification? Well I tried to say a little bit about how symmetries relate to nature a while back. Certain types of possible symmetry of particle physics can be classified the A-D-E way. The symmetry group corresponding to E8 is the now famous exceptional group E8. I won't be able to get to an explanation of how groups are involved. But at least I'll be able to give a hint about the bigger picture that E8 is part of.

Non-deterministic Finite State Automata


Here's a diagram representing a very simple non-deterministic finite state automaton (NDA):

It can be in one of two states. When in state A it can transition to state B and in state B it can only transition back to state B, but it can do so in two different ways.

Vector Automata


Now I'll introduce a more general kind of automaton: a vector automaton (VA). (I made that term up, it's not meant to correspond with anyone else's terminology.) Every vector automaton is built from an NDA. But each state corresponds to a finite dimensional vector space and each transition corresponds to a linear function mapping from the vector space of the source state to the vector space of the destination state. We could turn the above example into a VA by assigning a 1D vector space VA to A, a 2D vector space to VB and defining linear functions:

f : (x) -> (x,0)
g : (x,y) -> (-y,x)
h : (x,y) -> (y,-x)

A VA is just like an NDA in that it transitions from state to state according to the given transitions. But additionally it keeps track of a vector in the vector space corresponding to the current state. Each time it makes a transition the linear function corresponding to that function is applied to the vector. So in the example above, the NDA might start in state A with a scalar value x (ie. a 1D vector). When it makes its first transition its vector becomes the 2D vector (x,0) and after that each transition rotates the vector through 90 degrees clockwise or anticlockwise.

There's a lot of freedom in defining a VA given its underlying NDA. For each node you can pick any vector space you like of any finite dimension, and for each transition you can pick any linear function you like mapping between the source and target vector spaces.

Let's make this a little more formal. An NDA is a finite set of states combined with a finite set of transitions. Each transition has a source and destination, each of which is a state. That's it. You're allowed any finite number of transitions between states and a transition can have the same state as source and destination.

A VA is an NDA combined with a finite dimensional vector space attached to each state and a linear function for each transition such that the function maps from the vector space of the of the source to the vector space of the target.

Now consider this really simple NDA:

When we build a VA from this NDA, for convenience I'll call the vector spaces corresponding to A and B, A and B. And I'll call the linear function from A to B, f. How many VAs can be made from this VA? Clearly an infinite number. But a lot of them are very similar.

Suppose we assume A and B are 2-dimensional and write their elements as pairs (x,y). Suppose f(1,0)=u and f(0,1)=v and that u is not a multiple of v and both are non-zero. Then we can use u and v as a basis for B. If we write f as f' in this new basis we get f'(1,0)=(1,0) and f'(0,1)=(0,1). So by relabelling the basis of B we have actually revealed that an infinite number of choices for f reduce to the same thing apart from a change of basis in B.

Up to change of basis in A and B we find there are only three possibilities:
(1) u and v are distinct, non-zero, and not multiples of each other.
(2) u is non-zero but v is zero
(3) both u and v are zero

We started with an infinite number of possibilities for our particular choice of dimensions and ended up with just 3. We'll say that two VAs are equivalent if we can get one from the other by changing basis like this.

On the other hand, consider this NDA:

Let's choose A, B and C to be 1-dimensional vector spaces and define f, g and h as:

f(x) = x
g(x) = x
h(x) = λx

where λ is any real number. Then h(g(f(x)))=λx. We can choose any λ we like so we have an infinity of possibilities. No amount of basis change is going to change this fact. This is different from the case above because now we're comparing x and h(g(f(x))) which lie in the same vector space. So when our NDAs have loops in them, the space of possible VAs, for most choices of dimension, is infinite.

The sum of two VAs


If we have two VAs corresponding to the same NDA we can combine them together to make a single machine. The state is given by a state in the shared underlying NDA, but we now have a pair of vectors. Each time there is a transition we apply the pair of transforms to transform the two vector simultaneously. But we can encode a pair of vectors as a single vector simply by concatenating together the vector components in some basis. So this machine is just another VA. The dimension of the vector space for each state is simply the sum of the dimensions of the vector spaces in the original VAs. So, given two VAs we can sum them to get a third.

Given a VA for an NDA it may or may not be equivalent to the sum of two simpler VAs. If it isn't, it's said to be irreducible. If it is, then we can ask the same question about the simpler VAs. In this way, every VA is the sum of irreducible VAs.

Main Theorem


Now comes the main result I want to give:

If the graph underlying the NDA is one of the following list then (and only then) there is a finite number of inequivalent irreducible VAs for the NDA. All other VAs for that NDA are simply finite sums of machines from this finite set. Here's the list:

That's it! Weird huh? (Note that diagram Xn has n nodes.)

Strangely, those same diagrams (and a few more) appear (in a quite different way) when classifying the possible symmetries of fundamental particles. The symmetries are given the same names as these diagrams. And in just the same way we get those 'sporadic' symmetries leading up to E8. Those same diagrams also arise in Catastrophe Theory.

I'd like to sketch the proof, at least in one direction, but I seem to have run out of time. Another day perhaps. But note that none of these diagrams have loops for the reasons I gave above, and I've already shown that A2 gives only a finite number of choices for a certain choice of dimension.

Meanwhile, I should give the proper mathematician names for the things above. The diagrams listed above are examples of Dynkin diagrams. A non-deterministic automaton as described above is known as a quiver. A vector automaton is normally known as a representation of a quiver. And the theorem above is half of Gabriel's theorem.

Sunday, June 07, 2009

Hashing Molecules

Twenty or so years ago I worked for a pharmaceutical company that had a large database of compounds. That got me thinking about the problem of how to perform lookups based on molecular structures. If you can find a bunch of numbers that encapsulate the molecular structure then you can use them as database keys. But you need to ensure that the same molecule entered in two different ways gets mapped to the same numbers, and you'd like different molecules, such as stereoisomers, or even enantiomers to get mapped to different values.

That got me thinking and around then I came up with an idea for hashing molecules inspired by some of the mathematics I'd been doing not long before. I never got around to coding it up but twenty years later it dawned on me that I could easily do it using a similar technique to what I used for untangling tangles and translating trace diagrams. Anyway, as I've already given examples of how to translate diagrams to monadic expressions I'm going to skimp on the details in this post and just talk about things specific to molecular structures. For this post to make sense you have to understand those earlier posts.

So first comes the usual Haskell preamble:


> {-# LANGUAGE MultiParamTypeClasses,FlexibleInstances,FunctionalDependencies,GeneralizedNewtypeDeriving #-}

> module Main where

> import Data.HashTable
> import Data.List as L
> import Data.Int
> import Data.Array
> import GHC.Arr
> import qualified Data.Map as M
> import Control.Monad
> infixl 5 .+
> infixl 6 .*


I'm going to be using the vector space monad with a 4-dimensional vector space. This type labels the dimensions:


> data I = A | B | C | D deriving (Eq,Ord,Show,Ix,Enum)


n-valent atoms will be represented by functions that consume n-tuples. We'll start with a simple hash:


> c' (a,b,c,d) = hashString ("C" ++ show (a,b,c,d))


My prime motivation for using Haskell for this problem is that the code was super-easy to write and investigate. But it's inefficient. I'll talk about how to remedy this properly at the end. But for now I'm going to memoise many functions as arrays using a Memoisable type class with a memo method. So I'll be using c instead of c':

> c = memo $ \x -> symmetrise a4 c' x .* return ()

Note the use of the symmetrise function. The idea is that the 4 bonds coming out from (singly-bonded) carbon can be thought of as lying at the corners of a tetrahedron.

They have tetrahedral symmetry. So I'd like my hash to also have this symmetry so that, for example, c (i,j,k,l) == c (j,i,l,k). We can enforce this by summing over all 24 permutations of the arguments compatible with this symmetry, also known as A4. So a4 lists all of these permutations:


> a4 (i,j,k,l) = [
> (i,j,k,l),
> (j,i,l,k),
> (k,l,i,j),
> (l,k,j,i),

> (i,k,l,j),
> (i,l,j,k),

> (k,j,l,i),
> (l,j,i,k),

> (j,l,k,i),
> (l,i,k,j),

> (j,k,i,l),
> (k,i,j,l)
> ]


And symmetrise performs the summation:


> symmetrise group f x = sum (map f (group x))


We can define other molecules too. Hydrogen is easy:


> h = memo $ \a -> hashString ("H" ++ show a) .* return ()


Oxygen has S2 symmetry:


> s2 (i,j) = [ (i,j), (j,i) ]
> o' (a,b) = hashString ("O" ++ show (a,b))
> o = memo $ \x -> symmetrise s2 o' x .* return ()


You can think of a carbon atom and a hydrogen atom, say, as a pair of arrays cijkl and hi, and bonding them together as summation over a shared index. So methane would be the sum over i,j,k,l = A..D of cijklhihjhkhl. To this end, define a bond as:


> bond :: V Int32 I
> bond = return A .+ return B .+ return C .+ return D


We can make H2 like so:


> h2 = simp $ do
> i <- bond
> h ! i
> h ! i


i <- bond makes i a bond which we then attach to two hydrogen atoms. Evaluating h2 will give us the hash for hydrogen gas.

Rather than dive straight into CH4 we can construct some useful building blocks. Hydrogen with a bond already attached:


> h_ :: V Int32 I
> h_ = do
> m <- bond
> h ! m
> return m


I'm using a trailing underscore _ to indicate a free bond.

ch2_ accepts one bond and returns another. Once memoised it is, in effect, just a 4x4 matrix and can be used to rapidly build chains.


> ch2_ = memo $ \i -> simp $ do
> k <- bond
> l <- bond
> m <- bond
> h ! l
> h ! m
> c ! (i,k,l,m)
> return k


So here's code to make an alkyl chain with a free bind at the end.


> alkyl_ 0 = h_

> alkyl_ n = simp $ do
> i <- alkyl_ (n-1)
> ch2_ ! i


We can now make alkanes by attaching a hydrogen atom at the end and make methane as a special case:


> alkane n = simp $ alkyl_ n >>= (h ! )
> methane = alkane 1


Now a hydroxyl group:


> oh = memo $ \i -> simp $ do
> j <- bond
> h ! j
> o ! (i,j)


And you can have a drink on me:


> ethanol = simp $ do
> i <- alkyl_ 2
> oh ! i


Carbon double bonds turn out to be straightforward. We can simply use a pair of bonds:


> doubleBond :: V Int32 (I,I)
> doubleBond = simp $ do
> i <- bond
> j <- bond
> return (i,j)


Carbon-carbon double bonds tend not to twist and this is reflected in the hash. We'd have to apply symmetrise if we wanted twistable bonds.

We can make a pre-canned doubly bonded carbon atom pair:



> c_c = memo $ \(i,j,k,l) -> simp $ do
> (m,n) <- doubleBond
> c ! (i,j,m,n)
> c ! (m,n,k,l)


So we can build ethene like this


> ethene = simp $ do
> i <- h_
> j <- h_
> k <- h_
> l <- h_
> c_c ! (i,j,k,l)


Here's a methyl group with a free bond


> ch3_ = simp $ do
> l <- h_
> ch2_ ! l


So we can build a bunch more compounds


> propene = simp $ do
> j <- ch3_
> i <- h_
> k <- h_
> l <- h_
> c_c ! (i,j,k,l)

> cisbut2ene = simp $ do
> i <- ch3_
> j <- h_
> k <- ch3_
> l <- h_
> c_c ! (i,j,k,l)

> transbut2ene = simp $ do
> i <- h_
> j <- ch3_
> k <- ch3_
> l <- h_
> c_c ! (i,j,k,l)

> cisbut2ene' = simp $ do
> i <- h_
> j <- ch3_
> k <- h_
> l <- ch3_
> c_c ! (i,j,k,l)

> _2methylpropene = simp $ do
> i <- ch3_
> j <- ch3_
> k <- h_
> l <- h_
> c_c ! (i,j,k,l)

> _2methylpropene' = simp $ do
> i <- h_
> j <- h_
> k <- ch3_
> l <- ch3_
> c_c ! (i,j,k,l)


An interesting problem is building a benzene ring. Here's a first attempt with six free bonds:


> ring1 (p,q,r,s,t,u) = simp $ do
> i <- bond
> j <- bond
> k <- bond
> c_c ! (j,q,i,p)
> c_c ! (i,u,k,t)
> c_c ! (k,s,j,r)


The problem with that is that benzene rings are special. The electrons are 'delocalised' so that the ring has rotational symmetry. We need to sum over the two consistent ways to assign single and double bonds around the ring. For the more general case of interlocking benzene rings we must still sum over all consistent assignments.




> ring = memo $ \(p,q,r,s,t,u) -> simp $ ring1 (p,q,r,s,t,u) .+ ring1 (q,r,s,t,u,p)


And some more compunds:


> phenyl = memo $ \p -> simp $ do
> q <- h_
> r <- h_
> s <- h_
> t <- h_
> u <- h_
> ring ! (p,q,r,s,t,u)

> benzene :: V Int32 ()
> benzene = simp $ do
> i <- bond
> phenyl ! i
> h ! i

> phenol :: V Int32 ()
> phenol = simp $ do
> i <- bond
> phenyl ! i
> oh ! i

> toluene :: V Int32 ()
> toluene = simp $ do
> i <- ch3_
> phenyl ! i

> toluene' :: V Int32 ()
> toluene' = simp $ do
> p <- h_
> q <- h_
> r <- h_
> s <- ch3_
> t <- h_
> u <- h_
> ring ! (p,q,r,s,t,u)

> toluene'' :: V Int32 ()
> toluene'' = simp $ do
> p <- h_
> q <- h_
> r <- ch3_
> s <- h_
> t <- h_
> u <- h_
> ring ! (p,q,r,s,t,u)

> _1_2_dimethylbenzene = simp $ do
> p <- ch3_
> q <- ch3_
> r <- h_
> s <- h_
> t <- h_
> u <- h_
> ring ! (p,q,r,s,t,u)

> _1_3_dimethylbenzene = simp $ do
> p <- ch3_
> q <- h_
> r <- ch3_
> s <- h_
> t <- h_
> u <- h_
> ring ! (p,q,r,s,t,u)

> _1_4_dimethylbenzene = simp $ do
> p <- ch3_
> q <- h_
> r <- h_
> s <- ch3_
> t <- h_
> u <- h_
> ring ! (p,q,r,s,t,u)

> _1_5_dimethylbenzene = simp $ do
> p <- ch3_
> q <- h_
> r <- h_
> s <- h_
> t <- ch3_
> u <- h_
> ring ! (p,q,r,s,t,u)


As we might hope, the three different ways to define toluene give the same result. We also discover that 1,3- and 1,5-dimethylbenzene are the same compound (or at least probably are, this isn't a perfect hash).


> main = do
> print $ "toluene = " ++ show toluene
> print $ "toluene' = " ++ show toluene'
> print $ "toluene'' = " ++ show toluene''
> print $ "_1_2_dimethylbenzene = " ++ show _1_2_dimethylbenzene
> print $ "_1_3_dimethylbenzene = " ++ show _1_3_dimethylbenzene
> print $ "_1_4_dimethylbenzene = " ++ show _1_4_dimethylbenzene
> print $ "_1_5_dimethylbenzene = " ++ show _1_5_dimethylbenzene


Now I need to say something about performance. The above code is naive and performs many unnecessary summations. For example, hashing a long chain should only take time linear in its length. But using the above code indiscriminately could give you exponential time. A good implementation might take a divide and conquer approach: slice the molecule in half through a bunch of bonds, compute partial hashes for each half and then sew the halves together in time exponential in the number of bonds you sliced through. For the types of molecules I've seen in real pharmaceutical databases (say) this is actually pretty cheap if you're smart about the slicing. The hashes in the above code could probably be computed many thousands of times faster. As it is, you'll probably need to compile the above with optimisation.

I'm willing to bet that with small changes, and with suitable choice of matrices over the reals, we can get invariants of molecules that predict physical properties. These calulations are reminiscent of algorithms for various types of counting algorithm so at the very least they probably compute things that are meaningful from a statistical mechanics perspective.

Incidentally, this approach to stitching together atoms was inspired by an old paper by R. C. Penner on fatgraphs - nothing to do with chemistry. A few days ago he put a paper online about an application to modelling proteins.

Update: Since writing this I've found a name for what I'm doing. I'm converting a chemical structure diagram into a tensor network. There seems to be lots of literature on how to evaluate these efficiently. In effect, everything I've been doing in this blog in terms of converting diagrams to code is an example of evaluating a tensor network.




Now comes the memoisation class:


> class Memoisable ix where
> memo :: (ix -> a) -> Array ix a

> instance Memoisable I where
> memo f = array (A,D) [(i,f i) | i <- [A .. D]]

> instance Memoisable (I,I) where
> memo f = array ((A,A),(D,D)) [(i,f i) |
> p <- [A .. D],
> q <- [A .. D],
> let i = (p,q) ]

> instance Memoisable (I,I,I,I) where
> memo f = array ((A,A,A,A),(D,D,D,D)) [(i,f i) |
> p <- [A .. D],
> q <- [A .. D],
> r <- [A .. D],
> s <- [A .. D],
> let i = (p,q,r,s) ]

> instance Memoisable (I,I,I,I,I,I) where
> memo f = array ((A,A,A,A,A,A),(D,D,D,D,D,D)) [(i,f i) |
> p <- [A .. D],
> q <- [A .. D],
> r <- [A .. D],
> s <- [A .. D],
> t <- [A .. D],
> u <- [A .. D],
> let i = (p,q,r,s,t,u) ]


Missing instances from Data.Array:


> instance (Ix a1, Ix a2, Ix a3, Ix a4, Ix a5, Ix a6) => Ix (a1,a2,a3,a4,a5,a6) where
> range ((l1,l2,l3,l4,l5,l6),(u1,u2,u3,u4,u5,u6)) =
> [(i1,i2,i3,i4,i5,i6) | i1 <- range (l1,u1),
> i2 <- range (l2,u2),
> i3 <- range (l3,u3),
> i4 <- range (l4,u4),
> i5 <- range (l5,u5),
> i6 <- range (l6,u6)]

> unsafeIndex ((l1,l2,l3,l4,l5,l6),(u1,u2,u3,u4,u5,u6)) (i1,i2,i3,i4,i5,i6) =
> unsafeIndex (l6,u6) i6 + unsafeRangeSize (l6,u6) * (
> unsafeIndex (l5,u5) i5 + unsafeRangeSize (l5,u5) * (
> unsafeIndex (l4,u4) i4 + unsafeRangeSize (l4,u4) * (
> unsafeIndex (l3,u3) i3 + unsafeRangeSize (l3,u3) * (
> unsafeIndex (l2,u2) i2 + unsafeRangeSize (l2,u2) * (
> unsafeIndex (l1,u1) i1)))))

> inRange ((l1,l2,l3,l4,l5,l6),(u1,u2,u3,u4,u5,u6)) (i1,i2,i3,i4,i5,i6) =
> inRange (l1,u1) i1 && inRange (l2,u2) i2 &&
> inRange (l3,u3) i3 && inRange (l4,u4) i4 &&
> inRange (l5,u5) i5 && inRange (l6,u6) i6


And the same vector space monad I've used many times before. Strictly speaking, it's more like a semiring module monad as Int32 isn't a field.


> swap (x,y) = (y,x)

> class Num k => VectorSpace k v | v -> k where
> zero :: v
> (.+) :: v -> v -> v
> (.*) :: k -> v -> v
> (.-) :: v -> v -> v
> v1 .- v2 = v1 .+ ((-1).*v2)

> data V k a = V { unV :: [(k,a)] } deriving (Show)

> reduce x = filter ((/=0) . fst) $ fmap swap $ M.toList $ M.fromListWith (+) $ fmap swap $ x
> simp (V x) = V (reduce x)

> instance (Ord a,Num k) => Eq (V k a) where
> V x==V y = reduce x==reduce y

> instance (Ord a,Num k,Ord k) => Ord (V k a) where
> compare (V x) (V y) = compare (reduce x) (reduce y)

> instance Num k => Functor (V k) where
> fmap f (V as) = V $ map (\(k,a) -> (k,f a)) as

> instance Num k => Monad (V k) where
> return a = V [(1,a)]
> x >>= f = join (fmap f x)
> where join x = V $ concat $ fmap (uncurry scale) $ unV $ fmap unV x
> scale k1 as = map (\(k2,a) -> (k1*k2,a)) as

> instance Num r => MonadPlus (V r) where
> mzero = V []
> mplus (V x) (V y) = V (x++y)

> instance (Num k,Ord a) => VectorSpace k (V k a) where
> zero = V []
> V x .+ V y = V (x ++ y)
> (.*) k = (>>= (\a -> V [(k,a)]))

> e = return :: Num k => a -> V k a
> coefficient b (V bs) = maybe 0 id (L.lookup b (map swap (reduce bs)))

Blog Archive