Saturday, June 20, 2009

Automata and the A-D-E classification.

Introduction


The A-D-E classification is a strange ubiquitous pattern that appears in many branches of mathematics. Typically it appears when you try to classify certain types of mathematical construction. If the A-D-E classification applies then you end up with two infinite sequences of cryptically named objects (A1,A2,A3,...) and (D1,D2,D3,...) as well as three leftover objects called E6, E7 and E8. Unfortunately, most of these objects and their classifications are tricky to define using only elementary mathematics. However, there is one type of object that is classified in this way that can be given a relatively straightforward computational description involving a little linear algebra and assuming you know a tiny bit about automata.

But first: why care about the A-D-E classification? Well I tried to say a little bit about how symmetries relate to nature a while back. Certain types of possible symmetry of particle physics can be classified the A-D-E way. The symmetry group corresponding to E8 is the now famous exceptional group E8. I won't be able to get to an explanation of how groups are involved. But at least I'll be able to give a hint about the bigger picture that E8 is part of.

Non-deterministic Finite State Automata


Here's a diagram representing a very simple non-deterministic finite state automaton (NDA):

It can be in one of two states. When in state A it can transition to state B and in state B it can only transition back to state B, but it can do so in two different ways.

Vector Automata


Now I'll introduce a more general kind of automaton: a vector automaton (VA). (I made that term up, it's not meant to correspond with anyone else's terminology.) Every vector automaton is built from an NDA. But each state corresponds to a finite dimensional vector space and each transition corresponds to a linear function mapping from the vector space of the source state to the vector space of the destination state. We could turn the above example into a VA by assigning a 1D vector space VA to A, a 2D vector space to VB and defining linear functions:

f : (x) -> (x,0)
g : (x,y) -> (-y,x)
h : (x,y) -> (y,-x)

A VA is just like an NDA in that it transitions from state to state according to the given transitions. But additionally it keeps track of a vector in the vector space corresponding to the current state. Each time it makes a transition the linear function corresponding to that function is applied to the vector. So in the example above, the NDA might start in state A with a scalar value x (ie. a 1D vector). When it makes its first transition its vector becomes the 2D vector (x,0) and after that each transition rotates the vector through 90 degrees clockwise or anticlockwise.

There's a lot of freedom in defining a VA given its underlying NDA. For each node you can pick any vector space you like of any finite dimension, and for each transition you can pick any linear function you like mapping between the source and target vector spaces.

Let's make this a little more formal. An NDA is a finite set of states combined with a finite set of transitions. Each transition has a source and destination, each of which is a state. That's it. You're allowed any finite number of transitions between states and a transition can have the same state as source and destination.

A VA is an NDA combined with a finite dimensional vector space attached to each state and a linear function for each transition such that the function maps from the vector space of the of the source to the vector space of the target.

Now consider this really simple NDA:

When we build a VA from this NDA, for convenience I'll call the vector spaces corresponding to A and B, A and B. And I'll call the linear function from A to B, f. How many VAs can be made from this VA? Clearly an infinite number. But a lot of them are very similar.

Suppose we assume A and B are 2-dimensional and write their elements as pairs (x,y). Suppose f(1,0)=u and f(0,1)=v and that u is not a multiple of v and both are non-zero. Then we can use u and v as a basis for B. If we write f as f' in this new basis we get f'(1,0)=(1,0) and f'(0,1)=(0,1). So by relabelling the basis of B we have actually revealed that an infinite number of choices for f reduce to the same thing apart from a change of basis in B.

Up to change of basis in A and B we find there are only three possibilities:
(1) u and v are distinct, non-zero, and not multiples of each other.
(2) u is non-zero but v is zero
(3) both u and v are zero

We started with an infinite number of possibilities for our particular choice of dimensions and ended up with just 3. We'll say that two VAs are equivalent if we can get one from the other by changing basis like this.

On the other hand, consider this NDA:

Let's choose A, B and C to be 1-dimensional vector spaces and define f, g and h as:

f(x) = x
g(x) = x
h(x) = λx

where λ is any real number. Then h(g(f(x)))=λx. We can choose any λ we like so we have an infinity of possibilities. No amount of basis change is going to change this fact. This is different from the case above because now we're comparing x and h(g(f(x))) which lie in the same vector space. So when our NDAs have loops in them, the space of possible VAs, for most choices of dimension, is infinite.

The sum of two VAs


If we have two VAs corresponding to the same NDA we can combine them together to make a single machine. The state is given by a state in the shared underlying NDA, but we now have a pair of vectors. Each time there is a transition we apply the pair of transforms to transform the two vector simultaneously. But we can encode a pair of vectors as a single vector simply by concatenating together the vector components in some basis. So this machine is just another VA. The dimension of the vector space for each state is simply the sum of the dimensions of the vector spaces in the original VAs. So, given two VAs we can sum them to get a third.

Given a VA for an NDA it may or may not be equivalent to the sum of two simpler VAs. If it isn't, it's said to be irreducible. If it is, then we can ask the same question about the simpler VAs. In this way, every VA is the sum of irreducible VAs.

Main Theorem


Now comes the main result I want to give:

If the graph underlying the NDA is one of the following list then (and only then) there is a finite number of inequivalent irreducible VAs for the NDA. All other VAs for that NDA are simply finite sums of machines from this finite set. Here's the list:

That's it! Weird huh? (Note that diagram Xn has n nodes.)

Strangely, those same diagrams (and a few more) appear (in a quite different way) when classifying the possible symmetries of fundamental particles. The symmetries are given the same names as these diagrams. And in just the same way we get those 'sporadic' symmetries leading up to E8. Those same diagrams also arise in Catastrophe Theory.

I'd like to sketch the proof, at least in one direction, but I seem to have run out of time. Another day perhaps. But note that none of these diagrams have loops for the reasons I gave above, and I've already shown that A2 gives only a finite number of choices for a certain choice of dimension.

Meanwhile, I should give the proper mathematician names for the things above. The diagrams listed above are examples of Dynkin diagrams. A non-deterministic automaton as described above is known as a quiver. A vector automaton is normally known as a representation of a quiver. And the theorem above is half of Gabriel's theorem.

Sunday, June 07, 2009

Hashing Molecules

Twenty or so years ago I worked for a pharmaceutical company that had a large database of compounds. That got me thinking about the problem of how to perform lookups based on molecular structures. If you can find a bunch of numbers that encapsulate the molecular structure then you can use them as database keys. But you need to ensure that the same molecule entered in two different ways gets mapped to the same numbers, and you'd like different molecules, such as stereoisomers, or even enantiomers to get mapped to different values.

That got me thinking and around then I came up with an idea for hashing molecules inspired by some of the mathematics I'd been doing not long before. I never got around to coding it up but twenty years later it dawned on me that I could easily do it using a similar technique to what I used for untangling tangles and translating trace diagrams. Anyway, as I've already given examples of how to translate diagrams to monadic expressions I'm going to skimp on the details in this post and just talk about things specific to molecular structures. For this post to make sense you have to understand those earlier posts.

So first comes the usual Haskell preamble:


> {-# LANGUAGE MultiParamTypeClasses,FlexibleInstances,FunctionalDependencies,GeneralizedNewtypeDeriving #-}

> module Main where

> import Data.HashTable
> import Data.List as L
> import Data.Int
> import Data.Array
> import GHC.Arr
> import qualified Data.Map as M
> import Control.Monad
> infixl 5 .+
> infixl 6 .*


I'm going to be using the vector space monad with a 4-dimensional vector space. This type labels the dimensions:


> data I = A | B | C | D deriving (Eq,Ord,Show,Ix,Enum)


n-valent atoms will be represented by functions that consume n-tuples. We'll start with a simple hash:


> c' (a,b,c,d) = hashString ("C" ++ show (a,b,c,d))


My prime motivation for using Haskell for this problem is that the code was super-easy to write and investigate. But it's inefficient. I'll talk about how to remedy this properly at the end. But for now I'm going to memoise many functions as arrays using a Memoisable type class with a memo method. So I'll be using c instead of c':

> c = memo $ \x -> symmetrise a4 c' x .* return ()

Note the use of the symmetrise function. The idea is that the 4 bonds coming out from (singly-bonded) carbon can be thought of as lying at the corners of a tetrahedron.

They have tetrahedral symmetry. So I'd like my hash to also have this symmetry so that, for example, c (i,j,k,l) == c (j,i,l,k). We can enforce this by summing over all 24 permutations of the arguments compatible with this symmetry, also known as A4. So a4 lists all of these permutations:


> a4 (i,j,k,l) = [
> (i,j,k,l),
> (j,i,l,k),
> (k,l,i,j),
> (l,k,j,i),

> (i,k,l,j),
> (i,l,j,k),

> (k,j,l,i),
> (l,j,i,k),

> (j,l,k,i),
> (l,i,k,j),

> (j,k,i,l),
> (k,i,j,l)
> ]


And symmetrise performs the summation:


> symmetrise group f x = sum (map f (group x))


We can define other molecules too. Hydrogen is easy:


> h = memo $ \a -> hashString ("H" ++ show a) .* return ()


Oxygen has S2 symmetry:


> s2 (i,j) = [ (i,j), (j,i) ]
> o' (a,b) = hashString ("O" ++ show (a,b))
> o = memo $ \x -> symmetrise s2 o' x .* return ()


You can think of a carbon atom and a hydrogen atom, say, as a pair of arrays cijkl and hi, and bonding them together as summation over a shared index. So methane would be the sum over i,j,k,l = A..D of cijklhihjhkhl. To this end, define a bond as:


> bond :: V Int32 I
> bond = return A .+ return B .+ return C .+ return D


We can make H2 like so:


> h2 = simp $ do
> i <- bond
> h ! i
> h ! i


i <- bond makes i a bond which we then attach to two hydrogen atoms. Evaluating h2 will give us the hash for hydrogen gas.

Rather than dive straight into CH4 we can construct some useful building blocks. Hydrogen with a bond already attached:


> h_ :: V Int32 I
> h_ = do
> m <- bond
> h ! m
> return m


I'm using a trailing underscore _ to indicate a free bond.

ch2_ accepts one bond and returns another. Once memoised it is, in effect, just a 4x4 matrix and can be used to rapidly build chains.


> ch2_ = memo $ \i -> simp $ do
> k <- bond
> l <- bond
> m <- bond
> h ! l
> h ! m
> c ! (i,k,l,m)
> return k


So here's code to make an alkyl chain with a free bind at the end.


> alkyl_ 0 = h_

> alkyl_ n = simp $ do
> i <- alkyl_ (n-1)
> ch2_ ! i


We can now make alkanes by attaching a hydrogen atom at the end and make methane as a special case:


> alkane n = simp $ alkyl_ n >>= (h ! )
> methane = alkane 1


Now a hydroxyl group:


> oh = memo $ \i -> simp $ do
> j <- bond
> h ! j
> o ! (i,j)


And you can have a drink on me:


> ethanol = simp $ do
> i <- alkyl_ 2
> oh ! i


Carbon double bonds turn out to be straightforward. We can simply use a pair of bonds:


> doubleBond :: V Int32 (I,I)
> doubleBond = simp $ do
> i <- bond
> j <- bond
> return (i,j)


Carbon-carbon double bonds tend not to twist and this is reflected in the hash. We'd have to apply symmetrise if we wanted twistable bonds.

We can make a pre-canned doubly bonded carbon atom pair:



> c_c = memo $ \(i,j,k,l) -> simp $ do
> (m,n) <- doubleBond
> c ! (i,j,m,n)
> c ! (m,n,k,l)


So we can build ethene like this


> ethene = simp $ do
> i <- h_
> j <- h_
> k <- h_
> l <- h_
> c_c ! (i,j,k,l)


Here's a methyl group with a free bond


> ch3_ = simp $ do
> l <- h_
> ch2_ ! l


So we can build a bunch more compounds


> propene = simp $ do
> j <- ch3_
> i <- h_
> k <- h_
> l <- h_
> c_c ! (i,j,k,l)

> cisbut2ene = simp $ do
> i <- ch3_
> j <- h_
> k <- ch3_
> l <- h_
> c_c ! (i,j,k,l)

> transbut2ene = simp $ do
> i <- h_
> j <- ch3_
> k <- ch3_
> l <- h_
> c_c ! (i,j,k,l)

> cisbut2ene' = simp $ do
> i <- h_
> j <- ch3_
> k <- h_
> l <- ch3_
> c_c ! (i,j,k,l)

> _2methylpropene = simp $ do
> i <- ch3_
> j <- ch3_
> k <- h_
> l <- h_
> c_c ! (i,j,k,l)

> _2methylpropene' = simp $ do
> i <- h_
> j <- h_
> k <- ch3_
> l <- ch3_
> c_c ! (i,j,k,l)


An interesting problem is building a benzene ring. Here's a first attempt with six free bonds:


> ring1 (p,q,r,s,t,u) = simp $ do
> i <- bond
> j <- bond
> k <- bond
> c_c ! (j,q,i,p)
> c_c ! (i,u,k,t)
> c_c ! (k,s,j,r)


The problem with that is that benzene rings are special. The electrons are 'delocalised' so that the ring has rotational symmetry. We need to sum over the two consistent ways to assign single and double bonds around the ring. For the more general case of interlocking benzene rings we must still sum over all consistent assignments.




> ring = memo $ \(p,q,r,s,t,u) -> simp $ ring1 (p,q,r,s,t,u) .+ ring1 (q,r,s,t,u,p)


And some more compunds:


> phenyl = memo $ \p -> simp $ do
> q <- h_
> r <- h_
> s <- h_
> t <- h_
> u <- h_
> ring ! (p,q,r,s,t,u)

> benzene :: V Int32 ()
> benzene = simp $ do
> i <- bond
> phenyl ! i
> h ! i

> phenol :: V Int32 ()
> phenol = simp $ do
> i <- bond
> phenyl ! i
> oh ! i

> toluene :: V Int32 ()
> toluene = simp $ do
> i <- ch3_
> phenyl ! i

> toluene' :: V Int32 ()
> toluene' = simp $ do
> p <- h_
> q <- h_
> r <- h_
> s <- ch3_
> t <- h_
> u <- h_
> ring ! (p,q,r,s,t,u)

> toluene'' :: V Int32 ()
> toluene'' = simp $ do
> p <- h_
> q <- h_
> r <- ch3_
> s <- h_
> t <- h_
> u <- h_
> ring ! (p,q,r,s,t,u)

> _1_2_dimethylbenzene = simp $ do
> p <- ch3_
> q <- ch3_
> r <- h_
> s <- h_
> t <- h_
> u <- h_
> ring ! (p,q,r,s,t,u)

> _1_3_dimethylbenzene = simp $ do
> p <- ch3_
> q <- h_
> r <- ch3_
> s <- h_
> t <- h_
> u <- h_
> ring ! (p,q,r,s,t,u)

> _1_4_dimethylbenzene = simp $ do
> p <- ch3_
> q <- h_
> r <- h_
> s <- ch3_
> t <- h_
> u <- h_
> ring ! (p,q,r,s,t,u)

> _1_5_dimethylbenzene = simp $ do
> p <- ch3_
> q <- h_
> r <- h_
> s <- h_
> t <- ch3_
> u <- h_
> ring ! (p,q,r,s,t,u)


As we might hope, the three different ways to define toluene give the same result. We also discover that 1,3- and 1,5-dimethylbenzene are the same compound (or at least probably are, this isn't a perfect hash).


> main = do
> print $ "toluene = " ++ show toluene
> print $ "toluene' = " ++ show toluene'
> print $ "toluene'' = " ++ show toluene''
> print $ "_1_2_dimethylbenzene = " ++ show _1_2_dimethylbenzene
> print $ "_1_3_dimethylbenzene = " ++ show _1_3_dimethylbenzene
> print $ "_1_4_dimethylbenzene = " ++ show _1_4_dimethylbenzene
> print $ "_1_5_dimethylbenzene = " ++ show _1_5_dimethylbenzene


Now I need to say something about performance. The above code is naive and performs many unnecessary summations. For example, hashing a long chain should only take time linear in its length. But using the above code indiscriminately could give you exponential time. A good implementation might take a divide and conquer approach: slice the molecule in half through a bunch of bonds, compute partial hashes for each half and then sew the halves together in time exponential in the number of bonds you sliced through. For the types of molecules I've seen in real pharmaceutical databases (say) this is actually pretty cheap if you're smart about the slicing. The hashes in the above code could probably be computed many thousands of times faster. As it is, you'll probably need to compile the above with optimisation.

I'm willing to bet that with small changes, and with suitable choice of matrices over the reals, we can get invariants of molecules that predict physical properties. These calulations are reminiscent of algorithms for various types of counting algorithm so at the very least they probably compute things that are meaningful from a statistical mechanics perspective.

Incidentally, this approach to stitching together atoms was inspired by an old paper by R. C. Penner on fatgraphs - nothing to do with chemistry. A few days ago he put a paper online about an application to modelling proteins.

Update: Since writing this I've found a name for what I'm doing. I'm converting a chemical structure diagram into a tensor network. There seems to be lots of literature on how to evaluate these efficiently. In effect, everything I've been doing in this blog in terms of converting diagrams to code is an example of evaluating a tensor network.




Now comes the memoisation class:


> class Memoisable ix where
> memo :: (ix -> a) -> Array ix a

> instance Memoisable I where
> memo f = array (A,D) [(i,f i) | i <- [A .. D]]

> instance Memoisable (I,I) where
> memo f = array ((A,A),(D,D)) [(i,f i) |
> p <- [A .. D],
> q <- [A .. D],
> let i = (p,q) ]

> instance Memoisable (I,I,I,I) where
> memo f = array ((A,A,A,A),(D,D,D,D)) [(i,f i) |
> p <- [A .. D],
> q <- [A .. D],
> r <- [A .. D],
> s <- [A .. D],
> let i = (p,q,r,s) ]

> instance Memoisable (I,I,I,I,I,I) where
> memo f = array ((A,A,A,A,A,A),(D,D,D,D,D,D)) [(i,f i) |
> p <- [A .. D],
> q <- [A .. D],
> r <- [A .. D],
> s <- [A .. D],
> t <- [A .. D],
> u <- [A .. D],
> let i = (p,q,r,s,t,u) ]


Missing instances from Data.Array:


> instance (Ix a1, Ix a2, Ix a3, Ix a4, Ix a5, Ix a6) => Ix (a1,a2,a3,a4,a5,a6) where
> range ((l1,l2,l3,l4,l5,l6),(u1,u2,u3,u4,u5,u6)) =
> [(i1,i2,i3,i4,i5,i6) | i1 <- range (l1,u1),
> i2 <- range (l2,u2),
> i3 <- range (l3,u3),
> i4 <- range (l4,u4),
> i5 <- range (l5,u5),
> i6 <- range (l6,u6)]

> unsafeIndex ((l1,l2,l3,l4,l5,l6),(u1,u2,u3,u4,u5,u6)) (i1,i2,i3,i4,i5,i6) =
> unsafeIndex (l6,u6) i6 + unsafeRangeSize (l6,u6) * (
> unsafeIndex (l5,u5) i5 + unsafeRangeSize (l5,u5) * (
> unsafeIndex (l4,u4) i4 + unsafeRangeSize (l4,u4) * (
> unsafeIndex (l3,u3) i3 + unsafeRangeSize (l3,u3) * (
> unsafeIndex (l2,u2) i2 + unsafeRangeSize (l2,u2) * (
> unsafeIndex (l1,u1) i1)))))

> inRange ((l1,l2,l3,l4,l5,l6),(u1,u2,u3,u4,u5,u6)) (i1,i2,i3,i4,i5,i6) =
> inRange (l1,u1) i1 && inRange (l2,u2) i2 &&
> inRange (l3,u3) i3 && inRange (l4,u4) i4 &&
> inRange (l5,u5) i5 && inRange (l6,u6) i6


And the same vector space monad I've used many times before. Strictly speaking, it's more like a semiring module monad as Int32 isn't a field.


> swap (x,y) = (y,x)

> class Num k => VectorSpace k v | v -> k where
> zero :: v
> (.+) :: v -> v -> v
> (.*) :: k -> v -> v
> (.-) :: v -> v -> v
> v1 .- v2 = v1 .+ ((-1).*v2)

> data V k a = V { unV :: [(k,a)] } deriving (Show)

> reduce x = filter ((/=0) . fst) $ fmap swap $ M.toList $ M.fromListWith (+) $ fmap swap $ x
> simp (V x) = V (reduce x)

> instance (Ord a,Num k) => Eq (V k a) where
> V x==V y = reduce x==reduce y

> instance (Ord a,Num k,Ord k) => Ord (V k a) where
> compare (V x) (V y) = compare (reduce x) (reduce y)

> instance Num k => Functor (V k) where
> fmap f (V as) = V $ map (\(k,a) -> (k,f a)) as

> instance Num k => Monad (V k) where
> return a = V [(1,a)]
> x >>= f = join (fmap f x)
> where join x = V $ concat $ fmap (uncurry scale) $ unV $ fmap unV x
> scale k1 as = map (\(k2,a) -> (k1*k2,a)) as

> instance Num r => MonadPlus (V r) where
> mzero = V []
> mplus (V x) (V y) = V (x++y)

> instance (Num k,Ord a) => VectorSpace k (V k a) where
> zero = V []
> V x .+ V y = V (x ++ y)
> (.*) k = (>>= (\a -> V [(k,a)]))

> e = return :: Num k => a -> V k a
> coefficient b (V bs) = maybe 0 id (L.lookup b (map swap (reduce bs)))