Search This Blog

Saturday, December 05, 2009

Where do monads come from?

Introduction
Monads play an important part in implementing computational "effects" in Haskell, and yet monads come out of pure mathematics where they are used to describe operations in algebraic structures. What do these two applications have to do with each other? In this post I hope to answer that question and maybe give another view on monads.


> {-# LANGUAGE GADTs #-}


Algebra for Flowcharts
You might not have seen one of these in a long time. A flowchart describing a subroutine:



For now I'm considering flowcharts only with nodes to set or reset just one flag, test the flag, and return values. I'm not allowing any paths to reconverge. So these flowcharts are trees describing the possible paths of control flow.

We can describe such flowcharts with a more compact notation. Use xn to mean:



Think of xn being an variable of unknown value.
Use the + operator to mean tests of the flag. So x+y represents:



Note that this '+' isn't commutative because the left branch will always mean 'no' and the right branch will mean 'yes'.

And we'll use the symbols '0' and '1', applied on the left, to mean these two nodes respectively:



The flowchart above can now be written compactly as 1x3+0(x4+x5). (Note that I'm overloading the notation for addition and multiplication, this isn't intended to be ordinary addition and multiplication of numbers.)

It's clear that the 'return 5' never gets executed. So the above flowchart is equivalent to this one:



The rule describing this simplification can be expressed with our compact notation as the rule:

0(a+b) = 0a
1(a+b) = 1b


So we already have a strong clue that there is a connection between algebra and computation. Here are some more rules that you can check for yourself:

abc = bc (here a and b are each either 0 or 1)

a+(b+c) = a+c = (a+b)+c

a+a = a = 0a+1a


Substitution
Now suppose we wish to extend this subroutine so it does a bit more work. Instead of simply returning n we might want it to set a flag according to whether the previous result is even or odd, and subtract 1 if it is odd. It's a little awkward because I've insisted that these trees don't rejoin so we need to patch the tree in three places using the rule:



Note that the descriptions of the two cases and the n here aren't part of the flowchart, they're part of the language I'm using to talk about flowcharts.

We can do something similar for our algebraic expressions. We can define a mathematical function taking integers to algebraic expressions such that

f(n)
= 0xn if n is even
= 1xn-1 if n is odd

and now define an operation * by

f*u = u with xn replaced by f(n) throughout


Our updated flowchart can now be represented by f*(1x3+0(x4+x5)). Now you might see why I used the weird xn notation. The * operation is ordinary substitution of an expression for a variable that we learn in high school, so I used xn notation to make it clear that xn is a variable. You can think of f as a dictionary saying how the values of the variables are to be expanded.

Monads
Let TN be the set of expressions in our algebra where the subscripts of the x's come from the set N. Then we have the functions:

η:N→TN defined by η(n)=xn
and
μ:T(T(N))→T(N) defined by μ(a)=i*a, where i is the identity.

We have a monad. (Exercise: check the monad laws.) Note that this definition isn't specific to our particular algebra. This definition applies to a very broad class of algebraic expression. Monads in mathematics are used to describe substitution of subexpressions into expressions. But I hope you can now see why they have a role to play in computing: the act of substituting an expression for a variable is essentially the same thing as grafting more operations on to the end of our tree of computations.

But if you look at the definitions of typical monads in the Haskell libraries, they don't look much like trees, and the implementation of >>= doesn't look much like grafting trees. The rest of this post will be about working through a specific example to show that it really is about tree grafting after all.

A Type for Expression Trees
We'll need to define a type to represent our algebraic expressions in Haskell. The n is going to be used to label our variables, and s will be explained shortly:


> data Expr s n where


Now we need constructors to represent 0 and 1. We can simplify this by combining them into one constructor and using the type Bool to signify whether it is 0 or 1 that we want. But there's no need to restrict ourselves to Bool so I'll make the code more general and use s instead:


> Put :: s -> Expr s n -> Expr s n


So 0 is False and 1 is True and `Put` is our multiplication operator. (Maybe you've already guessed why I called it that.)

The next constructor will be for addition. I'm going to give it the weird name Get. The obvious thing to write would be:

Get :: Expr s n -> Expr s n -> Expr s n

We could uncurry that instead and use

Get :: (Expr s n, Expr s n) -> Expr s n

But we know that a pair (a, a) is the same as a function Bool -> a. So I'm going to actually use:


> Get :: (s -> Expr s n) -> Expr s n


So instead of giving us a binary node in a tree, Get gives us a possible infinite s-ary node. But we'll be working with s = Bool for now.

Now we need variables. We'll just use:


> Var :: n -> Expr s n


This is a recursive type so we can write a fold for it. If we'd been using generic library we'd get the fold for free but I'm giving it explicitly:


> fold :: (s -> w -> w) -> ((s -> w) -> w) -> (n -> w) -> Expr s n -> w
> fold put get var (Put s x) = put s (fold put get var x)
> fold put get var (Get c) = get (\s -> fold put get var (c s))
> fold put get var (Var n) = var n


We can implement the monad for substitution:


> instance Monad (Expr s) where
> return n = Var n
> x >>= f = fold Put Get f x where


Note how the fold recursively replaces Get with Get, Put with Put and Var n with f n.

By now the game's up. If you read my post from last week then you know that I'm building something like the State monad. But although this monad has a similar interface, internally it's nothing like State. For one thing, Expr is a tree type and internally it could end up with an arbitrarily large number of nodes. How can we extract the familar State monad from this?

Canonical Form
Expressions in various algebraic languages can often be reduced to a canonical form. For example, we have disjunctive normal form for (a fragment of) logic. Expressions in our language above can be reduced to a canonical form. In the language of flowcharts, every one of our flowcharts can be reduced to something of the form:



Algebraically the claim is that any expression reduces to something of the form au+bv. In the general case the items a and b are of type s and u and v can be represented by elements of n. So this is something of type ((s, n), (s, n)). Again we replace the pair by a function to get the type s -> (n, s). So we want a function to convert to canonical form:


> canonical :: Expr s n -> (s -> (n, s))
> canonical = fold put get var where


We can get a return box into canonical form like this:



Algebraically this is xn→0xn+1xn and we can implement it as


> var n = \s -> (n, s)


Now we need to simplify the application of 0 or 1 to something already in canonical form. 0(au+bv) = 0au = au and 1(au+bv) = 1bv = bv. Essentially it just picks the first or second term in the sum according to whether we're multiplying on the left by 0 or 1. But that's exactly what the function type in s -> (n, s) does. So we can just write:


> put s c = \t -> c s


Now we need to consider sums of two terms in canonical form. We can derive this just by looking at the sum of two sums: (p+q)+(r+s). This is just p+s. We get this by picking the left branch twice or the right branch twice, so we're just applying our function twice to the same thing:


> get c s = c s s


Comparing with my last post you should see that canonical is what I called an interpreter. So the act of interpreting these tree expressions is one and the same thing as reducing them to a canonical form. This shouldn't be so surprising - executing a functional program is itself a process of reduction to a normal form. Here's an example of using the above reduction to get our flowchart executed:


> test1 =
> Get $ \c -> if c
> then Put False $
> Get $ \d -> if d
> then Var 5
> else Var 4
> else
> Put True $
> Var 3
> go1 = canonical test1 False


canonical returns a different tree to the original tree. But we can easily map back to trees again:


> inject :: (s -> (n,s)) -> Expr s n
> inject f = Get $ (\(n, s) -> Put s (Var n)) . f


So the composition inject . canonical takes a tree and reduces it to canonical form keeping it as a tree.

We can view things like this: the State s n type is just the Expr s n type modulo an equivalence relation. The usual State type is a tree but this fact is obscured because we only work with a representative of each equivalence class, not entire trees. But the Monad instance is still simply tree grafting. We can prove this as follows:


> data State s n = State { runState :: s -> (n, s) }

> instance Monad (State s) where
> return n = State $ canonical $ Var n
> State f >>= g = State $ canonical $ inject f >>= (inject . runState . g)


Note how the implementations consist solely of isomorphisms or wrapping/unwrapping. The type State is simply inheriting the Monad instance from the tree type. We can now put Put and Get into a form usable in the monad. Again, we're just wrapping and "canonicalising" near-trivial operations.


> get = State $ canonical (Get (\s -> Var s))
> put n = State $ canonical $ Put n (Var ())

> test2 = do
> c <- get
> if c
> then do
> put False
> d <- get
> if d
> then return 5
> else return 4
> else do
> put True
> return 3

> go2 = runState test2 False


Summary
Like in the previous post We're starting with signatures for functions. We can form an expression tree type. With the substitution operation this naturally becomes a monad.

We also specified equations for our functions. This leads to the idea of a reduced form for our expression trees. The state monad (and, in fact, most of the standard monads) are examples of reduced forms. If we implement canonical for an algebra, the monad follows for free. More precisely: given a signature and equations, the tree type and monad can be derived automatically. It seems tricky to derive the rules for reducing expressions to canonical form, but once you've done that the monad for the reduced form can be derived automatically.

The act of evaluating a monadic expression is the same as reducing an expression tree to its canonical form.

Various Thoughts
Given two algebraic theories with their own signatures and equations they can be combined together. One obvious way is to simply "freely" throw all of the signatures and equations together. This gives the "sum" of the two theories. If we additionally impose some new equations, a certain type of compatibility condition between the theories, then we get the product. So starting with two theories we get the monads for the two theories and the monad for the sum or product. This might provide an alternative approach to monad transformers for combining effects. (Actually, I'm doubtful of this plan for automatically combining effects because surprising things can happen, but I still think there's something useful to be extracted from the theory.)

Could this form the basis for more monad tutorials? Programmers from imperative, functional and other backgrounds are familiar with the ideas of trees and substitutions.

By the way, it's sometimes useful to implement a monad by building a tree first and then interpreting it separately. That's what I did a few posts ago and I found it much easier than deriving a complete monad in one go.

These trees are related to the diagrams I drew here.

You can repeat much of what I said above with all kinds of algebraic structures. My ICFP talk was on what you can do with the monad that arises from vector spaces and a spent a slide or two on the rules for reduction to canonical form.

An earlier post of mine on substitution.

You can interpret the canonical form of the expressions above as saying that when you have a bunch of switches (in certain types of code) you can replace them with just one. This is reminiscent of the old folk theorem about loops. In fact, the algebra I describe above is a lot like Kleene Algebra with Tests which can be used to prove that result.

This was all motivated by this post on Lambda the Ultimate. Some comments by Derek Elkins (that he's probably long forgotten) and pieces of this by Conor McBride helped me write the code.

I originally planned to say more about forming sums and products of theories but the code is going to need more work...

Update: Following one of the links I gave above I just found Unimo which is closely related.



Thursday, November 26, 2009

Programming with impossible functions, or how to get along without monads.

A recent paper by Hyland and Power speculates about an alternative universe in which languages like Haskell develop along a path in which monads don't play a starring role. This post will be about making tentative steps in that direction to see what developer-friendly alternatives might exist to solve problems like writing code using I/O in an elegant and readable manner. I think all of what I'll say was worked out 20 years or more ago but there may be some good reasons to think about this stuff again.

We'll need some Haskell language extensions


> {-# LANGUAGE GADTs, RankNTypes, TypeOperators #-}

> import Prelude hiding (read, IO)
> import Control.Monad.Reader hiding (Reader)


Introduction
Let's start with this problem: We'd like to write pure functional code to read input values of type i sequentially from some source. In an imperative language we could define a function taking no arguments and returning type i. It might look like read :: () -> i, or even just read :: i. But in Haskell such a thing is impossible. Unlike in imperative languages, the value of something of type read :: i is a constant so it can't give a different value each time it is used. How could we implement something like the impossible function read in Haskell?

Here's a general sketch of how I plan to proceed: In some sense we want to add a new language feature to Haskell. So we need a new language that is Haskell plus read. Rather than write a whole new language, we can just define an embedded DSL using Haskell as our host language. When we want to perform read operations we build an abstract syntax tree (AST) for our code in the new language and pass it to an interpreter. Sounds complicated, But the last step will be to show how we can eliminate all of these steps and be left with simple and efficient Haskell code that does exactly what we want. I'll show you what the final code will look like for our read example:


> read_example1 =
> read $ \x ->
> read $ \y ->
> let z = x+y in
> read $ \w ->
> rvalue (z*w)


The same methodology will also work for other non-functional features.

A Reader
We start by defining a type to represent our AST in our read-extended language:


> data Reader i x where


I'm using GADTs because it's useful to make explicit the type of the constructors. The reason for i and x will be explained by the signatures of the constructors. Firstly, we need to represent the value of computations of any type that don't read anything. So we need a constructor like:


> RValue :: x -> Reader i x


Now how should we represent a computation that reads something? Well suppose we're reading something of type i (for input) from some source. We'd like the AST to tell us how such a value would be used once it is read. We express such an idea by writing a function that takes an input value and maps it to an AST. In other words:


> Read :: (i -> Reader i x) -> Reader i x


The argument to Read is an example of a continuation. It describes how to continue after an input is received.

That's it. Now we can write code like


> read_example2 = Read $ \x ->
> Read $ \y ->
> RValue (x+y)


There's just one thing wrong. It doesn't do anything. We have the syntax but no semantics. So we need to write an interpreter. We have some choice here, but I'll do something easy. I'll make our imaginary read function pick off values one by one from a list of inputs. Here is the implementation. We interpret an RValue by simply unwrapping it and ignoring the list of inputs:


> read_interpreter (RValue x) is = x


We handle Read by picking off the first input, applying our continuation to it, and then carrying on by interpreting the result:


> read_interpreter (Read cont) (i:is) = read_interpreter (cont i) is


Now we're ready to go:


> go1 = read_interpreter read_example2 [1,2,3,4,5]


One obvious advantage of leaving things as an AST is that we can apply different interpreters to read from different sources. But if we know what our source is to be we can optimise a bit by completely eliminating the interpreter. We can do this using a completely generic method that I described here. The Reader type is a pretty standard recursive type. In our examples we build a type, but when we interpret we immediately tear it apart again in a reduction operation. The reduction is an example of a generalised fold. Following what I described in that earlier post, the reduction operation looks like this:


> rfold :: (x -> w) -> ((c -> w) -> w) -> Reader c x -> w
> rfold rvalue read (RValue x) = rvalue x
> rfold rvalue read (Read cont) = read (rfold rvalue read . cont)


If you look at read_interpreter you can see there are two parts. There's the part that does the work of picking off inputs, and there's the part that recursively applies read_interpreter. rfold provides the latter so we only have to implement the former. So we now get:


> read_interpreter' = rfold rvalue read where
> rvalue x is = x
> read cont (i:is) = cont i is

> go2 = read_interpreter' read_example2 [1,2,3,4,5]


Now comes the magic step described in that earlier post. Rather than separately build a tree, and then fold it, we can fuse the two operations together by simply replacing the constructors with the arguments to rfold:


> read cont (i:is) = cont i is
> rvalue x is = x


And now we can test the first example:


> go3 = read_example1 [1,2,3,4,5]


All that remains is to work out the details of some more examples.

Writer
An obvious one is a writer. In an imperative language we'd expect a signature like write :: o -> (). Our AST will need a node to represent a write operation. It'll need to contain both the written value, and the continuation. In this case the continuation doesn't actually depend on a hypothetical input, it's just another AST. So we get


> data Writer o x where
> Write :: o -> Writer o x -> Writer o x
> WValue :: x -> Writer o x


Again we need to interpret. We can do this a bit like the Writer monad. The final value will be a pair consisting of a value and a list of outputs.


> write_interpreter (Write o x) = let (v, os) = write_interpreter x in (v, o:os)
> write_interpreter (WValue x) = (x, [])


And now we can implement things like:


> write_example1 from to =
> if from<=to
> then Write from $ write_example1 (from+1) to
> else WValue ()

> go4 = write_interpreter (write_example1 10 20)


Again we can construct a fold for this type:


> wfold :: (x -> w) -> (o -> w -> w) -> Writer o x -> w
> wfold value write (WValue x) = value x
> wfold value write (Write o x) = write o (wfold value write x)

> write_interpreter' = wfold value write where
> value x = (x, [])
> write o (x, os) = (x, o:os)


And eliminating the fold gives us:


> wvalue x = (x, [])
> write o (x, os) = (x, o:os)


Maybe you might have guessed an implementation like that without going through the AST. But it's good to step back and look at the pattern emerging here. For an imperative function of type v1 -> v2 -> ... -> vn -> w we want an AST constructor of type v1 -> v2 -> ... -> vn -> (w -> AST) -> AST, where AST is the type of our tree. Figuring that out gives us a half-way point making the job of writing our interpreter slightly easier. You can sort of read the signature as saying "there's no function mapping mapping v1 -> v2 -> ... -> vn -> v doing what I want, but imagine I had one anyway, what would I do with the result it gave me?".

But, if you look closely at the definition of Writer you might notice that it is, in fact, a type of list structure. It's basically a list of elements of type o terminating in an element of type x. All the interpreter is doing is converting it to a conventional list structure. And if you go back to the definition of Reader you'll see that it too is exactly the right type for the class. It's a sort of dual to a list, one that consumes elements as you work along its length rather than yielding them. We only need an interpreter because we want to apply this object to a specific datatype, a list of inputs.

The fact that we can derive these types automatically from signatures is a really wonderful and deep point that I'm going to have to return to in another post. Usually, when you implement a monad, you have to come up with a type that can express the semantics you want. But that type can (in some sense) be derived automatically from the signatures of the operations you want, and some equations specifying how these operations interact.

IO
We can now try to comine the writer and reader to handle simultaneous input and output:


> data IO i o x where
> IValue :: x -> IO i o x
> IRead :: (i -> IO i o x) -> IO i o x
> IWrite :: o -> IO i o x -> IO i o x


The interpreter is very similar to each of the individual interpreters. It applies our AST to a list of inputs and gives us back a pair consisting of a return value and a list of outputs:


> io_interpreter (IValue x) is = (x, [])
> io_interpreter (IRead cont) (i:is) = io_interpreter (cont i) is
> io_interpreter (IWrite o cont) is = let (x, os) = io_interpreter cont is in (x, o:os)

> io_example1 =
> IRead $ \x ->
> IRead $ \y ->
> IWrite (x+y) $
> IValue ()

> go5 = io_interpreter io_example1 [1..10]


The corresponding fold function is an amalgam of the previous ones:


> ifold :: (x -> w) -> ((i -> w) -> w) -> (o -> w -> w) -> IO i o x -> w
> ifold value read write (IRead cont) = read (ifold value read write . cont)
> ifold value read write (IWrite o cont) = write o (ifold value read write cont)
> ifold value read write (IValue x) = value x

> io_interpreter' = ifold value read write where
> value x is = (x, [])
> read cont (i:is) = cont i is
> write o cont is = let (x, os) = cont is in (x, o:os)

> go6 = io_interpreter' io_example1 [1..10]

> ivalue x is = (x, [])
> iread cont (i:is) = cont i is
> iwrite o cont is = let (x, os) = cont is in (x, o:os)

> io_example2 =
> iread $ \x ->
> iread $ \y ->
> iwrite (x+y) $
> ivalue ()

> go7 = io_example2 [1..10]


Playing with Control Flow
But we're not just limited to these kinds of side-effects. We can also play with control flow. For example, consider an imperative function of tye throw :: e -> () that throws a value out of our computation to the top level. If we followed the description above we'd be tempted to replace that with something of type e -> Exception e x -> Exception e x, just like with write. But that type uses a continuation to say how to proceed after writing. We don't want there to be any continuation, we just want to bomb out with the thrown value. So the type we need is actually e -> Exception e x.


> data Exception e x where
> Throw :: e -> Exception e x
> EValue :: x -> Exception e x


You might recognise that. It's just Either. But we can go through the motions anyway:


> exception_interpret (Throw e) = Left e
> exception_interpret (EValue x) = Right x

> safediv x y = if y==0 then Throw "SIGFPE" else EValue (x `div` y)
> throw_example1 = safediv 1 0
> go8 = exception_interpret throw_example1

> efold :: (e -> w) -> (x -> w) -> Exception e x -> w
> efold throw evalue (Throw e) = throw e
> efold throw evalue (EValue x) = evalue x

> exception_interpret' = efold Left Right

> throw = Left
> evalue = Right


On its own that's not a very useful example. Still, it gives a warmup for a much more interesting control-flow affecting example. Suppose we want to implement the imperative function fork :: [u] -> u that forks one process for each element of the list with each fork getting that element as the return value. So let x = fork [1,2,3] would result in three threads, one with x=1, one with x=2 and one with x=3.


> data ManyWorlds x where
> MValue :: x -> ManyWorlds x


The following line follows directly by applying the principle I described earlier to fork :: [u] -> u:


> Fork :: [u] -> (u -> ManyWorlds x) -> ManyWorlds x


Now we need an interpreter. The state of the system at any time is a list of values, one for each thread. So we can use


> manyworlds_interpret (MValue x) = [x]


When we fork, we want to apply the continuation to each element of the list. We'll then have a list of lists of threads which we collect together into a single list using concat.


> manyworlds_interpret (Fork as cont) = concat (map manyworlds_interpret (map cont as))

> fork_example1 =
> Fork [1,2,3] $ \x ->
> Fork [10,20,30] $ \y ->
> MValue (x+y)

> go9 = manyworlds_interpret fork_example1


Again we can write this as a fold:


> mfold :: (x -> w) -> (forall u. [u] -> (u -> w) -> w) -> ManyWorlds x -> w
> mfold value fork (MValue x) = value x
> mfold value fork (Fork xs u) = fork xs (mfold value fork . u)

> manyworlds_interpret' = mfold value fork where
> value u = [u]
> fork us cont = concatMap cont us


And the interpreter-free version is:


> fork xs cont = concatMap cont xs
> mvalue u = [u]


I hope that looks familiar!

Here's a recursive example to generate the cartesian product of a bunch of lists:


> prod [] = [[]]
> prod (x:xs) = fork x $ \u ->
> fork (prod xs) $ \v ->
> mvalue (u:v)

> fork_example2 = prod [[1,2],[3,4],[10,20]]


Abstract Nonsense
This section is just to give a taste of some deeper stuff going on.

The problem to solve is this: given two side-effectful "extensions" like those above, how to you combine them and still keep functional purity?

As above, given just the signatures of the "functions" in our extensions to Haskell we can build a datatype. These signatures form part of the description of an algebraic theory. The other part is a set of equations, something I skipped above. For example, if I had talked about State then we might have an equation stating that if we do a put, followed by two gets, then the gets would both "get" the same value. With a bunch of signatures and equations, the corresponding type, and even a monad, follow automatically. (You can probably see the monads lurking behind all of my examples above.)

Now, when we write two monads, there is no automatic way to combine them. We have to instead write monad transformers and there is no theory for how to automatically construct them.

On the other hand, there are ways to automatically combine algebraic theories (and hence the monads derived from them). So the latter approach seems like it could be much more powerful. The theory is the primary thing, the monad is just something derived from it.

But the catch is that when I said "automatically" I meant in the sense that there is a precise mathematical description. It's not clear to me yet how to turn the entire process into something that a compiler could perform for you.

And the goal isn't really to eliminate monads. In fact, do-notation is a great way to work and gives nicer looking code than my examples. It's just that maybe we need to be thinking about the theories first and monads second.


You are a protocol droid, are you not?
Anyway, on a tangent, here's a bit of code I spun off from the above that runs as a self-contained example in its own .hs file. I tweaked read and write for the IO case by using the constructor (,) to construct heterogeneous lists instead of using (:) to construct homogeneous lists. I then feed the outputs of a pair of IO expressions into each other as inputs. The result is a type-safe way to implement simple imperative looking coroutines in a couple of lines of code. Good for playing with protocols as a opposed to algorithms:


import Prelude hiding (read)

read f (c,cs) = f c cs
write o r is = let (v, os) = r is in (v, (o,os))
value x is = (x, ())

-- Diffie-Hellman key exchange

p = 23
g = 5

alice =
write "Hello" $
let a = 6 in
write (g^a `mod` p) $
read $ \b ->
let s = b^a `mod` p in
value s

bob =
read $ \message ->
let b = 15 in
write (g^b `mod` p) $
read $ \a ->
let s = a^b `mod` p in
value s

(alice_value, alice_output) = alice bob_output
(bob_value, bob_output ) = bob alice_output

go = print (bob_value,alice_value)




Wednesday, November 04, 2009

Memoizing Polymorphic Functions with High School Algebra and Quantifiers

A little while back Conal Elliott asked about the memoization of polymorphic types. I thought it'd be fun to describe how to memoize such functions in the same spirit as Ralph Hinze's use of tries to memoize non-polymorphic functions. Along the way I'll try to give a brief introduction to quantified types in Haskell as well as showing some applications of the Yoneda lemma at work.

You can think of a generalized trie for a function type T as a type that's isomorphic to T but doesn't have an arrow '->' anywhere in its definition. It's something that contains all the same information as a function, but as a data structure rather than as a function. Hinze showed how to construct these by using the high school algebra axioms on non-polymorphic types. Polymorphic types are types involving quantification. So to make suitable tries for these we need to add some rules for handling quantifiers to high school algebra.

At first it seems unlikely that we could memoize polymorphic types. When Hinze demonstrated how to construct generalized tries he showed how to make a tree structure that was tailored to the specific types in hand. With polymorphic functions we don't know what types we'll be dealing with, so we need a one-size fits all type. That sounds impossible, but it's not.

The first thing we need to look at is universal quantification. Suppose F(a) is a type expression involving the symbol a. Then the type ∀a.F(a) can be thought of as being a bit like the product of F(a) for all possible values of a. So ∀a.F(a) is a bit like the imaginary infinite tuple

data Forall f a = (f Bool, f Int, f Char, f String, f [Bool], f (IO (Int -> Char)), ...)

One reason you can think of it this way is that all of the projections exist. So for any type you choose, say B, there is a function (∀a.F(a)) -> F(B). In Haskell the ∀ is written as forall and probably the best known example is the Haskell id function of type forall a. a -> a. For any concrete type B, id gives us a function of type B -> B. Note that we usually write the signature simply as a -> a. Haskell implicitly prepends a forall for every free variable in a type. We have to use the following line of code if we want to be able to use forall explicitly (among other things):


> {-# LANGUAGE RankNTypes, ExistentialQuantification, EmptyDataDecls #-}


I'll approach the tries through a series of propositions. So here's our first one:

Proposition 1
∀a. a = 0

0 is the type with no elements. ∀a. a is a type that can give us an object of type B for any B. There is no way to to this. How could such a function manufacture an element of B for any B with nothing to work from? It would have to work even for types that haven't been defined yet. (By the way, do you notice a similarity with the axiom of choice?) So ∀a. a is the type with no elements. Here's the usual way to write the type with no elements:


> data Void


We also have:
Proposition 2
∀a. aa = 1

If we have a function of type forall a. a -> a then for any element of type a you give it, it can give you back an element of type a. There's only one way to do this - it must give you back what you gave it. It can't transform that element in any way because there is no uniform transformation you could write that works for all values of a. So ∀a. aa has one element, id.

A slightly more interesting proposition is this:
Proposition 3
∀a. aa.a = 2

A function of type (a,a) -> a gives you an a when you give it a pair of a's. As we don't know in advance what type a will be we can't write code that examines a in any way. So a function of this type must return one of the pair, and which one it returns can't depend on the value of the argument. So there are only two functions of this type, fst and snd.

We can rewrite the last proposition as ∀a. aa2 = 2. That suggests that maybe ∀a. aan = n for any type n. We can go one better. Here's another proposition:

Proposition 4
For any functor F and type n, ∀a. F(a)an = F(n)

I've already talked about that result. Here's an implementation of the isomorphisms:


> yoneda :: (forall b . (a -> b) -> f b) -> f a
> yoneda t = t id

> yoneda' :: Functor f => f a -> (forall b . (a -> b) -> f b)
> yoneda' a f = fmap f a


Throughtout this article I'll use the convention that if f is an isomorphism, f' is its inverse.

Now it's time to look at a kind of dual of the above propositions. Instead of universal quantification we'll consider existential quantification. The type ∃a.F(a) is a kind of infinite sum of all types of the form F(a). We can imagine it being a bit like the following definition:

data Exist f a = ABool (f Bool) | AnInt (f Int) | AChar (f Char) | AString (f String) | AListOfBool (f [Bool]) ...

The important point is that given any element of any type we can turn it into an element of ∃a.F(a). You'd think that we could write this in Haskell as exists a. F(a) but unfortunately Haskell does things differently. The idea behind the notation is this: as we can put anything of type F(b) into it. So if X = ∃a.F(a) then we have a function F(a) -> X for any a. So we have a function of type ∀a. F(a) -> X. So although this type is existentially quantified, its constructor is universally quantified. We tell Haskell to make a type existentially quantified by telling it the constructor is universally quantified:


> data Exist f a = forall a. Exist (f a)


You can think of Exist as not being a single constructor, but an infinite family of constructors, like ABool, AnInt, etc. above.

If you have an element of an ordinary non-polymorphic algebraic sum type then the only way you can do stuff to it is to apply case analysis. To do something with an existential type means you have to perform a kind of infinite case analysis. So to do something with an element of ∃a. F(a) you need to provide an infinite family of functions, one for each possible type. In other words, you need to apply a function of type ∀a. F(a) →B to it.

Time for another proposition:

Proposition 5
∃a. a = 1

It seems weird at first that the sum of all types is 1. But once you put something into this type, you can no longer get any information about it back out again. If you try doing case analysis you have to provide a polymorphic function that accepts an argument of type ∀a. a, which is as good as saying you can't do any case analysis. Proposition 5 is actually a special case of the following:

Proposition 6
For any functor f, ∃a. (na, f(a)) = f(n)

Briefly, the reason this is that the only thing you can do with a matching pair of na and f(a) is apply the former to the latter using fmap. This is a kind of dual to the Yoneda lemma and I say more about it here.

We already know from high school algebra that this is true:

Proposition 7
xy+z=xy.xz.

We can write the isomorphisms explicitly:


> prop7 :: (Either a b -> c) -> (a -> c, b -> c)
> prop7 f = (f . Left, f . Right)

> prop7' :: (a -> c, b -> c) -> Either a b -> c
> prop7' = uncurry either


It should be no surprise that the following 'infinite' version is true as well:

Proposition 8
x∃a. f(a) = ∀a. xf(a).

We can write the isomorphism directly:


> prop8 :: (Exist f a -> x) -> forall a. f a -> x
> prop8 g x = g (Exist x)
> prop8' :: (forall a. f a -> x) -> Exist f a -> x
> prop8' g (Exist x) = g x


We're now equipped to start constructing generalized tries for polymorphic functions. So let's consider memoizing the type forall a. [a] -> f a, for f a functor. At first this looks hard. We have to memoize a function that can take as argument a list of any type. How can we build a trie if we don't know anything in advance about the type of a? The solution is straightforward. We follow Hinze in applying a bit of high school algebra along with some of the propositions from above. By definition, L(a) = [a] is a solution to the equation L(a) = 1+a.L(a). So we want to simplify ∀a. f(a)L(a). We have

f(a)L(a) = f(a)1+a.L(a) = f(a).f(a)a.L(a) = f(a).f(a)a+a2.L(a) = f(a).f(a)a.f(a)a2.L(a)

I hope you can see a bit of a pattern forming. Let's define T(n) = f(a)an.L(a). Then

T(n) = f(a)an.(1+a.L(a)) = f(a)an.T(n+1) = f(n).T(n+1)

That's it! We can translate this definition directly into Haskell.


> data T f n = T (f n) (T f (Maybe n))


I'm using the fact that Maybe n is standard Haskell for the type n+1. (But note that this equality is only valid when we think of the list type as data, not codata. So like with Hinze's original tries, values at infinite lists don't get memoized.)

To build the isomorphism we need to trace through the steps in the derivation. At one point we used an+a1+n.L(a) = an.L(a) which we can implement as the pair of isomorphisms:


> lemma :: Either (n -> a) (Maybe n -> a, [a]) -> (n -> a, [a])
> lemma (Left f) = (f, [])
> lemma (Right (f, xs)) = (\n -> f (Just n),f Nothing : xs)

> lemma' :: (n -> a, [a]) -> Either (n -> a) (Maybe n -> a, [a])
> lemma' (f, []) = Left f
> lemma' (f, x:xs) = Right (maybe x f, xs)


We can put the other steps together with this to give:


> memoize :: Functor f => (forall a. (n -> a, [a]) -> f a) -> T f n
> memoize f = let x = prop7 (f . lemma)
> in T (yoneda (fst x)) (memoize (snd x))

> memoize' :: Functor f => T f n -> forall a. (n -> a, [a]) -> f a
> memoize' (T a b) = let y = (yoneda' a, memoize' b)
> in prop7' y . lemma'


Let's try a few examples. I'll use the identity functor for the first example.


> data I a = I a deriving Show
> instance Functor I where
> fmap f (I a) = I (f a)


Here's our first test function and some data to try it on:


> test1 (f, xs) = I $ if length xs>0 then head xs else f ()
> data1 = (const 1,[2,3,4])


In data1 we'e using a function to represent a kind of 'head' before the main list. For the next example we're leaving the first element of the pair undefined so that data2 is effectively of list type:


> test2 (f, xs) = reverse xs
> data2 = (undefined,[1..10])


We can test them by building the memoized versions of these functions.


> memo1 = memoize test1 :: T I ()
> memo2 = memoize test2 :: T [] Void


and then apply them


> ex1 = memoize' memo1 data1
> ex2 = memoize' memo2 data2


It appears to work!

So what's actually going on? We have

T(0) = f(0).T(1) = f(0).f(1).T(2) = ... = f(0).f(1).f(2).f(3)...


Now consider a function g :: [a] -> f a applied to a list. If the list isn't an infinite stream then it must have a certain length, say n. From these elements it builds something of type f a. However this f a is constructed, each of the elements of type a in it must be constructed from one of the n elements in the list. So if we apply g to the list [0,1,2,...,n-1] it will construct an element of f a where each a in it contains a label saying which position in the list it came from. (Compare with Neel's comment here). If we use integers we don't get a perfect trie because there are more elements of type f Integer than there are ways to indicate source positions. What we need is that for each length of list, n, we have a type with precisely n elements. And that's what the type n gives us.

We can memoize many different functions this way, though if the functor f is a function type you'll need to use some of Hinze's techniques to eliminate them. And you'll notice I haven't used all of the propositions above. I've tried to give some extra tools to allow people to memoize more types than just my two examples.

One last thing: I wouldn't use the type above to memoize in a real world application. But the methods above could be used to derive approximate tries that are efficient. One obvious example of an approximation would be to use Int instead of the finite types.


Update: I forgot to provide one very important link. This post was inspired by Thorsten Altenkirch's post here.

Saturday, October 31, 2009

Buffon's Needle, the Easy Way

Buffon's needle is a popular probability problem. Rule lines on the floor a distance d apart and toss a needle of length l<d onto it. What is the probability that the needle crosses a line? A solution is described at wikipedia but it involves a double integral and some trigonometry. Nowhere does it mention that there is a less familiar but much simpler proof, though if you follow the links you'll find it. In addition, the usual solution involves π but gives little intuition as to why π appears. The simpler proof reveals that it appears naturally as a ratio of the circumference of a circle to its diameter. I've known this problem since I was a kid and yet I hadn't seen the simpler proof until a friend sold me his copy of Introduction to Geometric Probability for $5 a few days ago.

So instead of solving Buffon's needle problem we'll solve what appears to be a harder problem: when thrown, what is the expectation of the number of times a rigid curved (in a plane) wire length l (no restriction on l) crosses one of our ruled lines d apart? Here's an example of one of these 'noodles'. It crosses the ruled lines three times:

Expectation is linear in the sense that E(A+B) = E(A)+E(B). So if we imagine the wire divided up into N very short segments of length l/N the expectation for the whole wire must be the sum of the expectations for all of the little pieces. If the wire is well behaved, for N large enough the segments are close to identical straight line segments. Here's a zoomed up view of a piece of our noodle:

For a small straight line segment the expectation must simply be a function of the length of the segment. The expectation for the whole wire is the expectation for one segment multiplied by the number of segments. In other words, the expectation is proportional to the length of the wire and we can write E(l)=kl for some constant k.

Now we know it's proportional to the length, we need to find the constant of proportionality, k. We need to 'calibrate' by thinking of a noodle shape where we know in advance exactly how many times it will cross the ruled lines. The following picture gives the solution:

A circle of diameter d will almost always cross the lines in two places. The length of this wire is πd so E(πd)=2 and k=2/πd.

The expected number of crossings for a wire of length l is 2l/πd. A needle of length l<d can intersect only zero or one times. So the expected value is in fact the probability of intersecting a line. The solution is 2l/πd.

No integrals needed.

The expected number of crossings is an example of an invariant measure, something I've talked about before. There are only a certain number of functions of a noodle that are additive and invariant under rotations and just knowing these facts is almost enough to pin down the solution.

Puzzle


Now I can leave you with a puzzle to solve. In the UK, a 50p coin is a 7 sided curvilinear polygon of constant width. Being constant width means a vending machine can consistently measure its width no matter how the coin is oriented in its plane. Can you use a variation of the argument above to compute the circumference of a 50p coin as a function of its width?


Tuesday, October 13, 2009

"What Category do Haskell Types and Functions Live In?"

The question in my title is one that is often raised by Haskell programmers and it's a difficult one to answer rigorously and satisfyingly. But you may notice that I've put the question in quotes. This is because I'm not asking the question myself. Instead I want to argue that often there's a better question to ask.

Superficially Haskell looks a lot like category theory. We have types that look like objects and functions that look like arrows. Given two functions we can compose them just how arrows compose in a category. We also have things that look like products, coproducts, other kinds of limit including infinite ones, natural transformations, Kan extensions, monads and quite a bit of 2-categorical structure.

So what goes wrong? (Besides the obvious problem that on a real computer, composing two working functions might result in a non-working function because you run out of memory.)

Among other things, Haskell functions can fail to terminate because of things like infinite loops. Computer scientists often use the notation ⊥ to represent a non-terminating computation. So when we talk of the Haskell integers, say, we don't just mean the values 0, 1, 2, ... but we also have to include ⊥. Unfortunately, when we do this we break a few things. For one thing we no longer have coproducts. But people find it useful to talk about algebraic datatypes as constructing types using products and coproducts and that would be a hard thing to give up.

So we could restrict ourselves to considering only the category theory of computable functions. But that's not a trivial thing to do either, and it doesn't reflect what real Haskell programs do.

But even if we did manage to tweak this and that to get a bona fide category out of Haskell, all we'd get is a custom tailored category that serves just one purpose. One theme running through much of my blog is that Haskell can be used to gain an understanding of a nice chunk of elementary category theory in general. Showing that Haskell simply gives us one example of a category really isn't that interesting. When I talked about the Yoneda Lemma I felt like I was talking about more than just one property of some obscure category that I can't actually define and that most category theorists have never even heard of.

So what's going on? Why does it feel like Haskell is so naturally category theoretical while the details are so messy?

Going back to my Yoneda lemma code, consider my definition of check

> check a f = fmap f a

It's straightforward to translate this into standard category theoretical notation that applies to any category. Even though the code is implemented in a specific programming language there's nothing about it that prevents it being translated for use in any category. So it doesn't matter what category Haskell corresponds to. What matters is that this bit of code is written in language suitable for any category. And the proof I give can be similarly translated.

Consider this standard problem given to category theory students: prove that (A×B)×C is isomorphic to A×(B×C). In Haskell we could construct the isomorphism as:

> iso :: ((a,b),c) -> (a,(b,c))
> iso ((x,y),z) = (x,(y,z))

But now we hit a problem. We can straightforwardly translate this into mathematical notation and it will give a valid isomorphism in the category of sets, Set. But iso is written to accept arguments which are elements of some type. Not all objects in categories have elements, and arrows might not correspond to functions. And even if they did, if we were working with (certain types of) topological spaces we'd be giving a construction for the isomorphism, and our proof would show the underlying function had an inverse, but we'd be failing to show it's continuous. It looks like writing Haskell code like this only tells us about a particularly limited type of category.

But not so. Type cabal install pointfree to install pointfree and then run pointfree 'iso ((x,y),z) = (x,(y,z))' and it responds with

> iso = uncurry (uncurry ((. (,)) . (.) . (,)))

pointfree rewrites a function in point-free style. There are no x's, y's or z's in the written version, only uncurry, composition (.), and the product function (,). These exist in all Cartesian closed categories (CCC). So our original function definition, despite apparently referring to elements, can be mechanically turned into a definition valid for any CCC. We can now reinterpret the meaning of x, y and z in the original definition as not referring to elements at all, but as labels indicating how a bunch of fairly general categorically defined primitives are to be assembled together.

(Incidentally, my first foray into pure functional programming was to write a SASL compiler. It was little more than a bunch of rewrite rules to convert SASL code into point-free compositions of S, K and I, among other combinators.)

What we have here is an example of an internal language at work. I'm not sure what a precise definition of "internal language" is, but it's something like this: take a formal system and find a way to translate it to talk about categories in such a way that true propositions in one are turned into true propositions in the other. The formal system now becomes an internal language for those categories.

The best known example is topos theory. A topos is a category that has a bunch of properties that make it a bit like Set. We take a subset of the language of set theory that makes use of just these properties. Our propositions that look like set theory can now be mechanically translated into statements valid of all toposes. This means we can happily write lots of arguments referring to elements of objects in a topos and get correct results.

In their book Introduction to Higher-Order Categorical Logic, Lambek and Scott showed that "pure typed λ-calculus" is the internal language of CCCs. Even though expressions in the λ-calculus contain named variables these can always be eliminated and replaced by point-free forms. Theorems about typed λ-calculus are actually theorems about CCCs. When we write Haskell code with 'points' in it, we don't need to interpret these literally.

So despite not knowing which category Haskell lives in, much of the code I've written in these web pages talks about a wide variety of categories because Haskell is essentially a programming language based on an internal language (or a bunch of them). Despite the fact that even a function like iso might have quite complex semantics when run on a real computer, the uninterpreted programs themselves often represent completely rigorous, and quite general pieces of category theory.

So the question to ask isn't "what category does Haskell live in?" but "what class of category corresponds to the internal language in which I wrote this bit of code?". I partly answer this question for do-notation (a little internal language of its own) in an earlier post. Haskell (and various subsets and extensions) is essentially a way to give semantics to internal languages for various classes of category. However complicated and messy those semantics might get on a real world computer, the language itself is a thing of beauty and more general than might appear at first.

BTW This trick of reinterpreting what look like variables as something else was used by Roger Penrose in his abstract index notation. Just as we can sanitise variables by reinterpreting them as specification for plumbing in some category, Penrose reinterpreted what were originally indices into arrays of numbers as plumbing in another category. Actually, this isn't just an analogy. With a little reformatting abstract index notation is very close to the way I've been using monads to work with vector spaces so that abstract index notation can be viewed as a special case of an internal language for categories with monads.



Saturday, October 10, 2009

Vectors, Invariance, and Math APIs

Many software libraries, especially those for physics or 3D graphics, are equipped with tools for working with vectors. I'd like to point out how in these libraries the functions for manipulating vectors sometimes have special and useful properties that make it worthwhile to separate them out into their own interface.

Two types of Function


Suppose an object of mass m is moving with velocity v and we apply force f to it for time t. What is the final velocity? This is given by g:

g(m,t,f,v) = v+(t/m)*f

Now suppose that R is a rotation operation, typically represented by a matrix. What happens if we apply it to both of the vector arguments of g?

g(m,t,Rf,Rv) = Rv+(t/m)*Rf = Rg(m,t,f,v)

In other words, rotating the vector arguments is the same as rotating the vector result.

Another example: Consider the function that gives the force on an electric charge as a function of its velocity and the magnetic field:

f(e,v,B) = ev×B

It's essentially just the cross product. If you rotate both of the arguments to the cross product then the result is rotated too. The result is that

f(e,Rv,RB) = Rf(e,v,B)


On the other hand, many 3D APIs come with a function to perform componentwise multiplication of vectors. Write vectors x as triples (x0,x1,x2), and so on, we can write such a function as:

f(x,y) = (x0y0,x1y1,x2y2)

You can show that this doesn't have a similar property.

Rotational Invariance


To make things easy, let's restrict ourselves to functions of scalars and vectors. And when I say vector, I'm talking strictly about vectors representing magnitude and direction, but not positions. Examples of such vectors are velocities, accelerations, angular velocities, magnetic fields, and the difference between two positions. A function is said to be rotationally invariant if applying a rotation R to all of its vector arguments results in the same thing as applying R to all of the vectors in its value. This allows you to have a function that returns multiple vectors, like a tuple or array.

The first two functions I described above were rotationally invariant but the third wasn't. Notice how the first two examples also described physical processes. This is the important point: as far as we know, all of the laws of physics are rotationally invariant. If you write down an equation describing a physical process then replacing all of the vectors in it by their rotated counterparts must also result in a valid equation. So if you're describing a physical process with a computer program, and you end up with a function that isn't rotationally invariant, you've made a mistake somewhere.

Vector APIs


Vector APIs frequently come with all manner of functions. Some have the invariance property and some don't. If you write code that you'd like to be rotationally invariant, but it turns out that it isn't, you usually have to examine the code to find the bug. But if you separate the invariant functions into their own interface, and then write code using just that interface, the code is guaranteed to be invariant. If your programming language has reasonably strict types then you may even be able to arrange things so that the type signature of the function alone is enough to tell you that the function is invariant. In effect you are able to make the compiler prove that your function is invariant.

(As an aside, this is an example of why a good type system does much more than you might at first have guessed. They don't just stop you making typos, they can do things like prove that your programs satisfy certain geometrical properties.)

So what functions would you have in such an API? Among the essential rotationally invariant functions are:

1. Multiplcation of a vector by a scalar
2. Addition of vectors
3. Dot product
4. Cross product

In terms of these you can build functions such as

1. Vector length
2. Vector normalization
3. Rotation of one vector around an axis specified by another vector
4. Linear interpolation between vectors

What kinds of functions would be excluded?

1. Constructing a vector from three scalars, ie. f(x,y,z) = (x,y,z).
2. Constructing a vector form a single scalar, ie. f(x) = (x,x,x).
3. Extracting the ith component of a vector, ie. f(i,(x0,x1,x2)) = xi.
4. Pointwise multiplication of vectors.
5. Computing the elementwise cosine of a vector.

On seeing the first excluded example above you might ask "how am I supposed to construct vectors?" The point is that you don't program exclusively with an invariant API, you simply use it whenever you need to prove invariance.

Coordinate System Invariance


One purpose of writing to a particular interface is that it allows the API to hide implementation details from the user. Using a rotationally invariant API has a role to serve here. For example, many 3D renderers allow you to write shaders. These are essentially functions that compute the colour of a piece of geometry that needs rendering. You write a shader and the renderer then calls your shader as needed when a fragment of geometry passes through its pipeline. Frequently these are used for lighting calculations but there are all kinds of other things that may be computed in shaders.

In a 3D renderer different parts of the computation are often performed in different coordinate systems. For example it may be convenient to perform lighting calculations in a coordinate system oriented with the direction of the light. But the author of a renderer doesn't want to be committed to a particular choice. In order to do this, it is essential to be able to write shaders that are agnostic about which coordinate system is being used. If we work with rotationally invariant functions, our shaders are guaranteed to be agnostic in this way (assuming that the only kind of coordinate change that takes place is a rotation).

Lots More Types


I've concentrated on just one type of invariance, rotational invariance. If we consider more types of invariance then more types of interface naturally emerge. It would take too long to cover all of the details here so I'm just going to briefly sketch the beginnings of the more general case. So just read this section as a list of pointers to further reading.

For example, some functions are invariant under translations. These can be thoght of as functions of points in space. If we allow more general linear transformations then we find that some common functions transform 'oppositely' to vectors. In particular, normals to surfaces transform in this way. In fact, Pixar's Renderman has three distinct types, vectors, points and normals that captures these different invariances.

If we go back to rotations again but now extend these by allowing reflections then we find an interesting new phenomenon. For example, consider the result of reflecting in the x-y-plane, followed by reflecting in the y-z-plane followed by reflecting in the x-z-plane. This simply multiplies vectors by -1. Dot product is invariant under this: (-x)·(-y)=x·y. But cross product isn't because (-x)×(-y)=x×y. Even though the cross product is apparently vector, it doesn't get multiplied by -1. When we start considering invariance under reflection we find that some vectors behave differently. These are the pseudovectors and in effect they have their own separate type and interface. Interestingly, nature likes to keep pseudovectors and vectors separate except in parity violating phenomena. There are even pseudoscalars.

Incidentally, if you consider invariance under scaling you're led to the idea of encoding dimensions in your types.

Conclusion


If you're writing a vector API think about the invariance properties that your functions may have. If any are useful then it may be worth placing those in a separate interface. The more distinct types you have, the more properties you can make your compiler prove. Obviously this needs to be balanced against practicality, complexity for users and what you actually need. To some extent, many existing APIs make some of these distinctions with varying degrees of strictness. The main point I want to make clear is that these distinctions are based on invariance properties, something that not all developers of such APIs are aware of.

At some point I hope to return to this topic and enumerate all of the common vector-like types in a single framework. Unfortunately it's a big topic and I've only been able to scratch the surface here. In particular there are some subtle interplays between dimensions and types.

On a deeper level, I think there must be some type theoretical framework in which these invariance properties are free theorems.

Update: I believe some of this material is covered in Jimm Blinn's Vectors and Geometry and Objects, Oh My!, but I don't have access to that. I suspect that there is one big difference in my presentation: I'm not so interested here in vectors (or normals or whatever) in themselves but as defining interfaces to functions with invariance properties. Like the way category theorists shift the focus from objects to arrows. It makes a difference because it immediately gives theorems that our code is guaranteed to satisfy. It's the invariance property of the cross product (say) that is useful here, not the fact that the components of a vector transform a certain way when we change coordinates (because I might not even want to refer to coordinates in my code).

Example Code



To show that I'm talking about something very simple, but still powerful, here's some Haskell code:


> data Triple = T Float Float Float deriving Show

> class Vector v where
> (.+) :: v -> v -> v
> (.*) :: Float -> v -> v
> dot :: v -> v -> Float
> cross :: v -> v -> v

> instance Vector Triple where
> T x y z .+ T u v w = T (x+u) (y+v) (z+w)
> a .* T x y z = T (a*x) (a*y) (a*z)
> dot (T x y z) (T u v w) = x*u+y*v+z*w
> cross (T x y z) (T u v w) = T
> (y*w-v*z)
> (z*u-x*w)
> (x*v-y*u)


You can freely apply the four primitive functions to elements of type Triple but if you have a function of, say, signature


> f :: Vector v => (v,v,Float) -> [(v,v)]


you are guaranteed it is invariant.

Tuesday, September 29, 2009

test, ignore


\int_{0}^{1}\frac{x^{4}\left(1-x\right)^{4}}{1+x^{2}}dx
=\frac{22}{7}-\pi


You know, I think I won't delete this. It's a celebration of my having embedded some TeX using the advice at Bot Cyborg.

Saturday, September 26, 2009

Finite Differences of Types

Finite Differences of Real-Valued Functions


Conor McBride's discovery that you can differentiate container types to get useful constructions like zippers has to be one of the most amazing things I've seen in computer science. But seeing the success of differentiation suggests the idea of taking a step back and looking at finite differences.

Forget about types for the moment and consider functions on a field R. Given a function f:R→R we can define Δf:R×R→R by


Δf(x,y) = (f(x)-f(y))/(x-y)


Δ is the finite difference operator. But does it make any kind of sense for types? At first it seems not because we can't define subtraction and division of types. Can we massage this definition into a form that uses only addition and multiplication?

First consider Δc where c is a constant function. Then Δc(x,y)=0.

Now consider the identity function i(x)=x. Then Δi(x,y)=1.

Δ is linear in the sense that if f and g are functions, Δ(f+g) = Δf+Δg.

Now consider the product of two functions, f and g.


Δ(fg)(x,y) = (f(x)g(x)-f(y)g(y))/(x-y)

= (f(x)g(x)-f(x)g(y)+f(x)g(y)-f(y)g(y))/(x-y)

= f(x)Δg(x,y)+g(y)Δf(x,y)


So now we have a Leibniz-like rule. We can compute finite differences of polynomials without using subtraction or division! What's more, we can use these formulae to difference algebraic functions defined implicitly by polynomials. For example consider f(x)=1/(1-x). We can rewrite this implicitly, using only addition and multiplication, as

f(x) = 1+x f(x)

Differencing both sides we get

Δf(x,y) = x Δf(x,y)+f(y)

That tells us that Δf(x,y) = f(x)f(y).

Finite Differences of Types


We're now ready to apply our operator to types. Instead of functions on the reals we work with functors on the set of types. A good first example container is the functor F(X)=XN for an integer N. This is basically just an array of N elements of type X. We could apply the Leibniz rule repeatedly, but we expect to get the same result as if we'd worked over the reals. So setting f(x)=xN we get


Δf(x,y) = (xN-yN)/(x-y) = xN-1+xN-2y+xN-3y2+...+yN-1

So we know that on types, ΔF(X,Y) = XN-1+XN-2Y+...+YN-1.

There's a straightforward interpretation we can give this. Differentiating a type makes a hole in it. Finite differencing makes a hole in it, but everything to the left of the hole is of one type and everything on the right is another. For example, for F(X)=X3, ΔF(X,Y)=X2+XY+Y2 can be drawn as:

If you've been reading the right papers then at this point it should all become familiar. Finite differencing is none other than dissection, as described by Conor in his Jokers and Clowns paper. I don't know if he was aware that he was talking about finite differences - the paper itself talks about this being a type of derivative. It's sort of implicit when he writes the isomorphism:

right :: p j + (Δp c j , c) → (j , Δp c j ) + p c

With a little rearrangement this becomes the definition of finite difference.

Now that we've recognised dissection as finite difference we can reason informally about dissection using high school algebra. For example, we already know that lists, defined by L(X) = 1+X L(X) can be informally thought of as L(X)=1/(1-X). So using the example I gave above we see that ΔL(X,Y)=1/((1-X)(1-Y)) = L(X)L(Y). So the dissection of a list is a pair of lists, one for the left elements, and one for the right elements. Just what we'd expect.

Another example. Consider the trees defined by T(X)=X+T(X)2. Informally we can interpret this as T(X)=(1+√(1-4X))/2. A little algebraic manipulation, using (√x-√y)(√x+√y) = x-y shows that

ΔT(X,Y) = 1/(1-(T(X)+T(Y))


In other words, a dissection of a tree is a list of trees, each of which is a tree of X or a tree of Y. This corresponds to the fact that if you dissect a tree at some element, and then follow the path from the root to the hole left behind, then all of the left branches (in blue) are trees of type X and all of the right branches (in red) are trees of type Y.

If you're geometrically inclined then you can think of types with holes in them as being a kind of tangent to the space of types. Along those lines, dissections become secants. I think this geometric analogy can be taken a lot further and that in fact a non-trivial piece of differential geometry can be made to work with types. But that's for another day.

Oh, I almost forgot. Derivatives are what you get when you compute finite differences for points really close to each other. So I hope you can see that Δf(x,x)=df/dx giving us holes in terms of dissections. Conor mentions this in his paper.

We should also be able to use this approach to compute finite differences in other algebraic structures that don't have subtraction or division.

I can leave you with some exercises:

1. What does finite differencing mean when applied to both ordinary and exponential generating functions?

2. Can you derive the "chain rule" for finite differences? This can be useful when you compute dissections of types defined by sets of mutually recursive definitions.

3. Why is right, defined above, a massaged version of the definition of finite difference? (Hint: define d=((f(x)-f(y))/(x-y). In this equation, eliminate the division by a suitable multiplication and eliminate the subtraction by a suitable addition. And remember that (,) is Haskell notation for the product of types.)

Sunday, September 13, 2009

More Parsing With Best First Search


> {-# LANGUAGE NoMonomorphismRestriction,GeneralizedNewtypeDeriving #-}


I have three goals in this post:

1. Refactoring the technique in my previous post so that building the search tree is entirely separate from searching the tree.
2. Making it work with real-valued weights, not just integers
3. Applying it to an ambiguous parsing problem, making use of a type class to define an abstract grammar.


> import Control.Arrow
> import Control.Monad
> import Control.Monad.Instances
> import Control.Monad.State
> import Data.Either
> import Data.Function
> import Random
> import qualified Data.List as L
> import qualified Data.Map as M


Search Trees


The idea is that I want to search a tree of possibilities where each edge of the tree is marked with a weight. The goal will be to search for leaves that minimise the sum of the weights of the edges down to the leaf.

Here's an example tree:

The minimum weight leaf is at C. If we're working with probabilities then we'll use minus the log of the probability of a branch as the weight. That way multiplication of probabilities becomes additions of weights, and the likeliest leaf has the minimum weight path.

So here's the definition of a search tree. I've given both leaves and edges weights:


> data Search c a = Leaf { lb::c, leaf::a}
> | Choice { lb::c, choices::[Search c a] } deriving Show


(Compare with this.) lb is short for 'lower bound'. It provides a lower bound for the total weight of any option in this subtree (assuming non-negative weights). The tree in the diagram would look like:


> ex1 = Choice 0 [
> Choice (-log 0.1) [
> Leaf (-log 0.5) 'A',
> Leaf (-log 0.5) 'B'],
> Choice (-log 0.2) [
> Leaf (-log 0.6) 'C',
> Leaf (-log 0.4) 'D']]


This tree is a container in a straightforward way and so we can make it an instance of Functor:


> instance Functor (Search c) where
> fmap f (Leaf c a ) = Leaf c $ f a
> fmap f (Choice c as) = Choice c $ map (fmap f) as


But it's also a monad. >>= maps all of the elements of a tree to trees in their own right, and then grafts those trees into the parent tree:


> instance Num c => Monad (Search c) where
> return = Leaf 0
> a >>= f = join $ fmap f a where
> join (Leaf c a ) = Choice c [a]
> join (Choice c as) = Choice c $ map join as


It's easy to make trees into a MonadPlus by simply grafting trees into a new root. MonadPlus is meant to be a monoid, but this operation, as written below, isn't precisely associative. But it's 'morally' associative in that two terms that are meant to be equal describe equivalent search trees. So I'm not going to lose any sleep over it:


> instance Num c => MonadPlus (Search c) where
> mzero = Choice 0 []
> a `mplus` b = Choice 0 [a,b]


For our searching we'll need a priority queue. I'll use a skew tree with code I lifted from somewhere I've forgotten:


> data Ord a => Tree a = Null | Fork a (Tree a) (Tree a) deriving Show

> isEmpty :: Ord a => Tree a -> Bool
> isEmpty Null = True
> isEmpty (Fork x a b) = False

> minElem :: Ord a => Tree a -> a
> minElem (Fork x a b) = x

> deleteMin :: Ord a => Tree a -> Tree a
> deleteMin (Fork x a b) = merge a b

> insert :: Ord a => a -> Tree a -> Tree a
> insert x a = merge (Fork x Null Null) a

> merge :: Ord a => Tree a -> Tree a -> Tree a
> merge a Null = a
> merge Null b = b
> merge a b
> | minElem a <= minElem b = connect a b
> | otherwise = connect b a

> connect (Fork x a b) c = Fork x b (merge a c)


At each stage in the search we'll pick the unexplored branch with the lowest total weight so far. So when we compare trees we'll compare on their lower bounds. So we need an ordering on the trees as follows:


> instance (Num c) => Eq (Search c a) where
> (==) = (==) `on` lb

> instance (Num c,Ord c) => Ord (Search c a) where
> compare = compare `on` lb


The real cost of a choice isn't just the weight immediately visible in the tree but the cost of the journey you took to get there. We use the bumpUp function to put that extra cost into the part of the tree we're currently looking at:


> bumpUp delta (Leaf c a) = Leaf (delta+c) a
> bumpUp delta (Choice c as) = Choice (delta+c) as


The only tricky aspect to this code is that we need to be able to handle infinite trees. We can't have our code simply go off and return when it's found the next match because it might not be possible to do so in a finite time. Instead, the code needs to perform one operation at a time and report what it found at each point, even if that report is just stalling for time. We do this by returning a (possibly infinite) list containing elements that are either (1) the next item found or (2) a new update giving more information about the lower bound of the cost of any item that might be yet to come. This allows the caller to bail out of the search once the cost has passed a certain threshold.

(Returning a useless looking constructor to stall for time is a common design pattern in Haskell. It's an example of how programs that work with codata need to keep being productive and you get something similar with the -|Skip|- in Stream Fusion. First time I write the code I failed to do this and kept wondering why my infinite searches would just hang, despite my great efforts to make it as lazy as possible.)


> runSearch :: (Num c,Ord c) => Tree (Search c a) -> [Either c a]
> runSearch Null = []
> runSearch queue = let
> m = minElem queue
> queue' = deleteMin queue
> in case m of
> Leaf c a -> Left c : Right a : runSearch queue'
> Choice c as -> Left c : (runSearch $ foldl (flip insert) queue' $ map (bumpUp c) as)


A quick test of an infinite search: finding Pythagorean triples by brute force. We give each integer as cost one more than the previous one:

I guess this is actually Dijkstra's algorithm, but on a tree rather than a general graph.


> integers m = Choice 1 [Leaf 0 m,integers (m+1)]

> test = do
> a <- integers 1
> b <- integers 1
> c <- integers 1
> guard $ a*a+b*b==c*c
> return (a,b,c)

> test1 = runSearch (insert test Null)


If you run test1 you'll notice how the output is noisy because of all those Left w terms. If you'e not worried about non-termination you could just throw out redundant output like so:


> reduce [] = []
> reduce (Left a : Left b : bs) = reduce (Left b : bs)


Might as well convert weights to probabilities while we're at it:


> reduce (Left a : bs) = Left (exp (-a)) : reduce bs
> reduce (Right a : bs) = Right a : reduce bs


This version should be a lot less chatty:


> test2 = reduce test1


Grammar


Now that searching works I can turn to an application - a more sophisticated example of what I briefly looked at previously), parsing with ambiguous grammars. So let me first build types to represent parsed sentences in a toy grammar:


> data Noun = Noun String deriving (Show,Eq,Ord)
> data Verb = Verb String deriving (Show,Eq,Ord)
> data Adj = Adj String deriving (Show,Eq,Ord)
> data Prep = Prep String deriving (Show,Eq,Ord)


The following two are noun phrase and prepositional phrase:


> data NP = NP [Adj] Noun deriving (Show,Eq,Ord)
> data PP = PP Prep Noun deriving (Show,Eq,Ord)


And entire sentences:


> data Sentence = Sentence [NP] Verb [NP] [PP] deriving (Show,Eq,Ord)


We want to be able to print parsed sentences so here's a quick 'unparse' type class to recover the underlying string:


> class UnParse a where
> unParse :: a -> String

> instance UnParse Noun where
> unParse (Noun a) = a

> instance UnParse Verb where
> unParse (Verb a) = a

> instance UnParse Adj where
> unParse (Adj a) = a

> instance UnParse Prep where
> unParse (Prep a) = a

> instance UnParse NP where
> unParse (NP a b) = concatMap unParse a ++ unParse b

> instance UnParse PP where
> unParse (PP a b) = unParse a ++ unParse b

> instance UnParse Sentence where
> unParse (Sentence a b c d) = concatMap unParse a ++ unParse b ++ concatMap unParse c ++ concatMap unParse d


Now I'm going to approach the problem of parsing ambiguous sentences in two ways. One will be efficient, and one will be inefficient but represent the 'ground truth' against which we'll compare. (This reflects standard practice in graphics publications where authors compare their fancy new algorithm with an ultra-slow but reliable Monte Carlo ray-tracer.)

I'm going to assume that sentences in my language are described by a "context free" probability distribution so that a noun phrase, say, has a fixed probability of being made up of each possible combination of constituents regardless of the context in which it appears.

I need an English word for something that takes a grammar and does something with it but I'm at a loss to think of an example. I'll use 'transducer', even though I don't think that's right.

So a transducer is built from either terminal nodes of one character, or it's one of a choice of transducers, each with a given probability:


> class Transducer t where
> char :: Char -> t Char
> choose :: [(Float,t a)] -> t a


And here's our toy grammar. It's nothing like an actual natural language because real grammars take a long time to get right. Note I'm just giving the first couple of type signatures to show that the grammar uses only the Monad and Transducer interfaces:


> string :: (Monad t, Transducer t) => [Char] -> t [Char]
> string "" = return ""
> string (c:cs) = do {char c; string cs; return (c:cs)}


So, for example, a noun has a 50% chance of being the string ab and a 50% chance of being the string ba:


> noun :: (Monad t, Transducer t) => t Noun
> noun = do
> a <- choose [(0.5,string "ab"),(0.5,string "ba")]
> return $ Noun a

> verb :: (Monad t, Transducer t) => t Verb
> verb = do
> a <- choose [(0.5,string "aa"),(0.5,string "b")]
> return $ Verb a

> adjective :: (Monad t, Transducer t) => t Adj
> adjective = do
> a <- choose [(0.5,string "ab"),(0.5,string "aa")]
> return $ Adj a

> parsePrep = do
> a <- choose [(0.5,string "a"),(0.5,string "b")]
> return $ Prep a


Some of our "parts of speech" allow sequences of terms. We need some kind of probabilistic model of how many such terms we can expect. I'm going to assume the probability falls off exponentially with the number of items:


> many :: (Monad t, Transducer t) => Float -> t a -> t [a]
> many p t = choose [
> (p,return []),
> (1-p,do
> a <- t
> as <- many p t
> return $ a:as)]


I also have a convenience function for sequences of length at least 1:


> many1 p t = do
> a <- t
> as <- many p t
> return (a:as)


And now the rest of the grammar:


> parseNP = do
> a <- many 0.5 adjective
> b <- noun
> return $ NP a b

> parsePP = do
> a <- parsePrep
> b <- noun
> return $ PP a b

> sentence = do
> a <- many 0.5 parseNP
> b <- verb
> c <- many 0.5 parseNP
> d <- many 0.5 parsePP
> return $ Sentence a b c d


We're going to use this grammar with two instances of type Transducer. The first will use the rules of the grammar as production rules to generate random sentences. The second will parse strings using the grammar. So we get two uses from one 'transducer'. This is pretty powerful: we have described the grammar in an abstract way that doesn't asuume any particular use for it.


> newtype Generator a = Generator { unGen :: State StdGen a } deriving Monad
> newtype Parser a = Parser { runParse :: (String -> Search Float (a,String)) }


Let's implement the generation first:


> instance Transducer Generator where
> char a = return a
> choose p = do
> r <- Generator (State random)
> case (L.find ((>=r) . fst) $ zip (scanl1 (+) (map fst p)) (map snd p)) of
> Just opt -> snd opt


We can test it by generating a bunch of random sentences:


> gen = mkStdGen 12343210
> generate n partOfSpeech = (unGen $ sequence (replicate n partOfSpeech)) `evalState` gen

> test3 = mapM_ print $ generate 10 sentence


We can now use generate-and-test to estimate what proportion of randomly generated sentences match a given sentence:


> generateAndTest n partOfSpeech chars = do
> a <- generate n sentence
> guard $ unParse a == chars
> return a

> collectResults n partOfSpeech chars = M.fromListWith (+) $ map (flip (,) 1) $
> generateAndTest n partOfSpeech chars
> countResults n partOfSpeech chars = mapM_ print $ L.sortBy (flip compare `on` snd) $
> M.toList $ collectResults n partOfSpeech chars

> test4 = countResults 100000 (noun :: Parser Noun) "abab"


On the other hand we can build a parser, based on Hutton's, just like in my previous post except using this new tree search monad:


> instance Monad Parser where
> return a = Parser (\cs -> return (a,cs))
> p >>= f = Parser (\cs -> do
> (a,cs') <- runParse p cs
> runParse (f a) cs')

> instance MonadPlus Parser where
> mzero = Parser (\cs -> mzero)
> p `mplus` q = Parser (\cs -> runParse p cs `mplus` runParse q cs)

> instance Transducer Parser where
> char c = Parser $ char' where
> char' "" = mzero
> char' (a:as) = if a==c then return (a,as) else mzero
> choose p = foldl1 mplus $ map (\(p,x) -> prob p >> x) p where
> prob p = Parser (\cs -> Leaf (-log p) ((),cs))

> goParse (Parser f) x = runSearch $ insert (f x) Null

> end = Parser (\cs -> if cs=="" then return ((),"") else mzero)

> withEnd g = do
> a <- g
> end
> return a

> normalise results = let total = last (lefts results)
> in map (\x -> case x of
> Left a -> a / total
> Right b -> b
> ) results

> findParse chars = mapM_ print $ reduce $ runSearch $
> insert (runParse (withEnd sentence) chars) Null


Results


And now we can try running both methods on the same string:


> main = do
> let string = "ababbbab"
> findParse string
> print "-------------------"
> countResults 1000000 (sentence :: Parser Sentence) string


You should see the parsings from countResults in roughly the same proportion as the relative probabilities given by findParse. Remember that the relative probability of a given parsing is the last Left p term before that parsing. Try playing with string, the number of Monte Carlo runs and the seed. Remember that there is going to be some variation in the randomised algorithm, especially with hard to parse strings, but raising the number of runs will eventually give reasonable numbers. Of course ultimately we don't care about the Monte Carlo method so it's allowed to be slow.

Anyway, none of this is a new algorithm. You can find similar things in papers such as Probabilistic tree transducers and A Generalization of Dijkstra's Algorithm. But what is cool is how easily Haskell allows us to decouple the tree building part from the searching part. (And of course the tree is never fully built, it's built and destroyed lazily as needed.) All of the published algorithms have the parsing and searching hopelessly interleaved so it's hard to see what exactly is going on. Here the search algorithm doesn't need to know anything about grammars, or even that it is searching for parsings.

Semiring Parsing is also easy to implement this way.

BTW If you think my "ab" language is a bit to contrived, check out the last picture here for an example of some natural language that is in a similar spirit :-)

Blog Archive