Search This Blog

Sunday, December 30, 2012

Shuffles, Bayes' theorem and continuations.

Introduction
Back in the 80s when I was a kid I came across a program for the BBC Micro that could tell what card you had picked from a deck of cards even though you'd buried your card within the deck wherever you wanted and had cut and shuffled the deck. I thought I'd try to implement a slightly more sophisticated version of the same trick that could handle multiple shuffles, and multiple types of shuffle.
The idea is that we prepare a deck of cards in some known sequence and have someone pick out a card and place it at the top of the deck. They perform some kind of randomisation procedure on the cards, eg. cut and shuffle it a couple of times, and then you get to look at the final sequence of cards. Can we tell which card was picked out?
Some probability theory
Let's formalise this a bit. Our decks will have cards. There is a small number of initial states our deck can be in, corresponding to the known sequence with a single card moved to the top. Let's label these initial states . There is a (usually large) number of permutations that could be applied through shuffling. We'll label these . We'll try to do arrange that this isn't simply the set of all permutations (though it's not necessarily a disaster if it is).
We want to figure out the initial state given some final state . In other words we want
We can use Bayes theorem to get:
Now is the sum over all ways of starting with and ending up with . So
where the sum is over all such that . I'm assuming that the shuffles are independent of the initial sequence of cards. This gives us an algorithm. We do a brute force simulation of every possible shuffle that we're considering applied to each possible initial state. After each shuffle we sum the corresponding probability for those shuffles that give our known final state .
Each shuffle is going to be built up as the product of a sequence of building blocks with each block randomly selected based on what happened before. Let's call the blocks names like . So if then . As we work through the shuffle we will accumulate the probability. After the first block we have a probability of . The probability after the second is and so on. At any point we'll call the probability accumulated so far the importance. I've borrowed that name from the world of rendering because this algorithm has a remarkable similarity to recursive ray-tracing.
Some computer science
I'd like to be able to chain a sequence of shuffles. But wait! There's a catch! Today's the day I finally want to get around to checking out the lambda expression support in C++. I've been putting this off for years. (I'm using gcc 4.7.) So I'm not going to have a Haskell non-determinism monad to make life easy.
Suppose I have two types of shuffle, type and type . I could easily write a loop to iterate over all shuffles of type , and in the innermost part of the loop I could call another loop over all shuffles of type . But then if I want to replace with I have to change the code to replace the inner part with code for . That's no good. I'd like to be able to replace the innermost part of the outer loop with any code I want without actually editing that part of the code. It's easy with lambda expressions. I write the type loop code so that it takes as argument a lambda function representing what I want done inside the loop.
There's another way of looking at this. You can skip this paragraph if you don't care about the connection to Haskell. But in Haskell you might do something like this by using a non-determinism monad, or even a probability monad. But as I pointed out a while back, you can fake every monad using the continuation monad. One way to implement continuations in C++ is to use continuation passing style. And that's what I'll do. The continuations are just the lambdas that I mentioned in the previous paragraph.
Some C++ code


> #include <iostream>
> #include <cstdlib>
> using namespace std;
You can bump this up the deck size if you have the CPU power:
> const int deck_size = 13;
A deck of cards is represented by a simple array of integers with each card being assigned a unique integer.
> struct Deck {
>     int card[deck_size];
>     bool operator==(const Deck &other) {
>         for (int i = 0; i < deck_size; ++i) {
>             if (card[i] != other.card[i]) {
>                 return false;
>             }
>         }
>         return true;
>     }
> };
The riffle shuffle works by splitting a deck into two piles and interleaving the parts onto a new destination deck. Here's a schematic diagram with the two piles coloured orange and blue:
The function riffle_helper helps loop through all possible riffles. I could assume that each card arriving at the destination is equally likely to come from the left pile or the right pile. But I observe that whenever I do a real riffle shuffle the cards seem to come in 'runs'. So if a card falls from the left pile then the next one is more likely to as well. That's just an empirical observation based on a small number of trials, you can tweak the probabilities yourself to fit reality better. (Oh, and I got this code upside-down compared to what people really do. I need to fix it when I have a moment...)

> enum Side {
>     LEFT,
>     RIGHT,
>     NO_SIDE
> };
This function shuffles together cards from the locations given by left_ptr and right_ptr in src_deck into dest_deck, eventually calling cont on each result. I use a template because I don't know the type of the lambda expression I'm passing in. (If I want to know its type I think I have to mess with decltype. It's all a bit weird.)
> template<class Cont>
> void riffle_helper(double importance, int split,
>                    int left_ptr, int right_ptr, int dest_ptr, Side oldside,
>                    const Deck &src_deck, Deck dest_deck, Cont cont) {
>     if (dest_ptr == deck_size) {
>         cont(importance, dest_deck);
>         return;
>     }
First I deal with the cases where one or other of the piles is empty so there's no choice about where the next card is coming from:
>     if (left_ptr >= split) {
>         dest_deck.card[dest_ptr] = src_deck.card[right_ptr];
>         riffle_helper(importance, split, left_ptr, right_ptr+1, dest_ptr+1, RIGHT, src_deck, dest_deck, cont);
>         return;
>     }
>     if (right_ptr >= deck_size) {
>         dest_deck.card[dest_ptr] = src_deck.card[left_ptr];
>         riffle_helper(importance, split, left_ptr+1, right_ptr, dest_ptr+1, LEFT, src_deck, dest_deck, cont);
>         return;
>     }
>     double p;
>     if (oldside == NO_SIDE) {
>         p = 0.5;
>     } else {
>         p = LEFT == oldside ? 0.75 : 0.25;
>     }
>     double new_importance = importance*p;
>     dest_deck.card[dest_ptr] = src_deck.card[left_ptr];
>     riffle_helper(new_importance, split, left_ptr+1, right_ptr, dest_ptr+1, LEFT, src_deck, dest_deck, cont);

> if (oldside == NO_SIDE) { > p = 0.5; > } else { > p = RIGHT == oldside ? 0.75 : 0.25; > } > new_importance = importance*p; > dest_deck.card[dest_ptr] = src_deck.card[right_ptr]; > riffle_helper(new_importance, split, left_ptr, right_ptr+1, dest_ptr+1, RIGHT, src_deck, dest_deck, cont); > }
The function riffle iterates over all possible riffle shuffles of src_deck calling cont on each one. Note that I assume that when the deck is split into two before shuffling together, each pile has at least 3 cards. You may want to change that assumption.
> template<class Cont>
> void riffle(double importance, const Deck &src_deck, Cont cont) {
>     double new_importance = importance/(deck_size-5);
>     for (int split = 3; split < deck_size-2; ++split) {
>         riffle_helper(new_importance, split, 0, split, 0, NO_SIDE, src_deck, Deck(), cont);
>     }
> }
Iterate over all possible cuts of src_dec calling cont on each result. I assume the cut leaves at least 3 cards in each pile.
> template<class Cont>
> void cut(double importance, const Deck &src_deck, Cont cont) {
>     double new_importance = importance/(deck_size-5);
>     for (int split = 3; split < deck_size-2; ++split) {
>         Deck new_deck;
>         for (int i = 0; i < deck_size; ++i) {
>             if (i < deck_size-split) {
>                 new_deck.card[i] = src_deck.card[i+split];
>             } else {
>                 new_deck.card[i] = src_deck.card[i-(deck_size-split)];
>             }
>         }
>         cont(new_importance, new_deck);
>     }
> }
Overhand shuffle remaining cards in src_deck to dest_deck. Here's an attempt to represent what an overhand shuffle does. It reverses the order of a deck that has been split into segments. The order within each segment is left unchanged.

> template<class Cont>
> void overhand_helper(double importance, const Deck &src_deck,
>                      int cards_left, Deck dest_deck, Cont cont) {
>     if (cards_left <= 0) {
>         cont(importance, dest_deck);
>     } else {
>         double new_importance = importance/cards_left;
>         for (int ncards = 1; ncards <= cards_left; ++ncards) {
>             //
>             // Take i cards from the source and place them at the bottom of the
>             // destination.
>             //
>             for (int j = 0; j < ncards; ++j) {
>                 dest_deck.card[cards_left-ncards+j] = src_deck.card[deck_size-cards_left+j];
>             }
>             overhand_helper(new_importance, src_deck, cards_left-ncards, dest_deck, cont);
>         }
>     }
> }
Iterate over all possible overhand shuffles of cards in src_deck calling cont on each result. In practice I often find overhand shuffles result in cards mysteriously jumping segments and messing up the algorithm, whereas poorly executed riffle shuffles still work fine. I'm also assuming that each time a pile of cards is transferred the size of the pile is chosen uniformly from the set of all possible segments at that stage.
> template<class Cont>
> void overhand(double importance, const Deck &src_deck, Cont cont) {
>     overhand_helper(importance, src_deck, deck_size, Deck(), cont);
> }
The final code doesn't bother computing the denominator from Bayes' theorem. The most likely initial state is given by the one that results in the highest score. If you normalise the scores to sum to one you'll get actual probabilities.
> int main() {
This is the array representation of the cards in the following picture:
>     Deck target = {{ 10, 11, 6, 12, 1, 13, 8, 2, 9, 3, 5, 4, 7 }};

>     Deck deck;
Our known starting sequence is just 1, 2, 3, ..., J, Q, K. We iterate over all ways to pick a card out from this sequence and place it at the top.
>     for (int k = 0; k < deck_size; ++k) {
>         deck.card[0] = k+1;
>         for (int i = 1; i < deck_size; ++i) {
>             deck.card[i] = (i > k ? i : i-1)+1;
>         }
>         double likelihood = 0.0;
Here is where I use the lambdas. For this example I'm doing an overhand shuffle followed by a riffle shuffle. (The syntax is pretty bizarre and its also weird that I kinda sorta specify the type of my lambda but that's not really what the type of the expression is. But having manually faked and lifted lambdas many times in C++ I can see why it's the way it is.) Note how I've made likelihood mutable and have given these lambda expressions write access to it.
>         overhand(1.0, deck, [&likelihood, target](double importance, Deck &deck) -> void {
>            riffle(importance, deck, [&likelihood, target](double importance, Deck &deck) -> void {
>                if (deck == target) {
We sum the probabilities for all ways of generating the target deck:
>                    likelihood += importance;
>                }
>         }); });
>         cout << "If top card = " << deck.card[0] << endl;
>         cout << "then unnormalised probability = " << likelihood << endl;
>         cout << endl;
>     }

> }
Run the above code and you get unnormalised probabilities
If top card = 4
then unnormalised probability = 5.7568e-12
If top card = 6
then unnormalised probability = 5.37301e-11
If top card = 7
then unnormalised probability = 1.791e-11
In fact, I had chosen 6. Some discussion
Don't expect it to work perfectly! It can only give probabilities but it's often surprisingly good. But there is a lot of room for improvement. Some work looking at how people actually shuffle could give a better probabilistic model.
Some exercises.
1. The code can be made orders of magnitude faster. The final shuffle is performed and then the result is compared to the target sequence. But you can start comparing cards with the target before the shuffle is finished. Most times you'll only need to look at the first card of the result of a shuffle before you know you haven't matched the target. Fixing this will give a big speed up.
2. The continuation passing style makes it easy to incorporate other sources of knowledge. For example if you 'accidentally' peek at the bottom card after the first shuffle you can incorporate that knowledge into the algorithm. Figure out how.
3. Write lots more kinds of shuffles and experiment. I'm hoping someone good with magic will come up with a sequence of operations that looks hopelessly random but allows a good probability of recovering the chosen card. You could also combine this with other techniques such as designing shuffles that maintain various kinds of invariant.
4. The code can be rewritten to work backwards from the final state to the initial states. Work out how to do this. (This parallels ray-tracing where you can work from eye to light or from light to eye.)
5. We're doing the same work over and over again. We don't need to compute all of the shuffles for each initial state. We can compute each shuffle once and reuse it on each initial state. Try implementing it.







































Sunday, November 18, 2012

A pictorial proof of the hairy ball theorem

The hairy-ball theorem says that there is no continuous non-zero vector field on the surface of a sphere. There are lots of popular accounts that tell you what this means, giving great examples. Here's a Youtube video for example:



My goal is to show why it's always true.

A simply connected domain in the plane is one with the property that any loop in it can be shrunk down to a point. Here's an example of a domain D with an example loop L being shrunk down to a point P:

Here's an example of a domain that's not simply connected. It has a hole in the middle. I've drawn a L loop around the hole. You can't shrink that loop to a point because the hole gets in the way:
Here's a simply connected domain with a vector field on it:
Think of the vectors as being drawn literally in the surface so that if we were to pick up the surface and stretch it like a piece of rubber the vectors would get stretched with it. Remember that a vector field is defined everywhere in the domain so the arrows are just a random sprinkling of examples to show what's going on. For this to be an accurate picture you want to imagine an infinity of arrows, one at every single point of the domain.

Let's put a loop, starting and ending at P, in our simply-connected domain:
Now imagine travelling along the loop, starting at P and ending at P. As you move along there's an arrow at each point in your journey. Here's what the arrows look like as you travel from P to P anti-clockwise, plotted as a kind of graph:
The vectors start off pointing to the right. They swing anti-clockwise by about 45º and then swing back to where they started. As the journey is a loop they clearly must end where they started. A different, really swirly vector field, might have resulted in arrows that that rotated around hundreds of times along your journey. But by time you reach the end of the journey they must swing back to where they started. What's slightly less obvious is that they'd also have to rotate back to cancel out the hundreds of swings. You might think "the vectors could rotate round a hundred times but as long as they make exactly 100 turns they'll return to where they started and there's no need for them to unwind". But actually, every bit of rotation in the journey must be unwound. The total amount of rotation, adding all the positive rotations, and subtracting off the negative rotations, is called the winding number for the loop. We count anti-clockwise rotation as positive and clockwise as negative. So I'm claiming that the winding number for a closed loop in a simply-connected domain is always zero.

(Note: in most books the winding number normally refers to how many times the loop itself winds around a point. I'm using it to refer to how many times the vector winds around itself you follow the loop. To help with your intuition: the hour hand of a working clock normally accumulates a winding number of -2 in one day. If it ran forward for a day, but then ran backwards for half a day, the winding number would be -1.)

Here's why the winding number for simply connected domains must be zero: firstly - it's pretty clear that the winding number for any loop must be an integer. If the winding number was a half, say, the arrow wouldn't end up pointing 180º from where it started which makes no sense for a closed loop. Now the domain is simply connected, so the loop can be shrunk to a point. Now imagine doing the shrinking really slowly and keeping track of the winding number as the loop shrinks. As the loop shrinks, the graph of the vectors along the loop must vary slowly. The total winding number depends continuously on the vectors in the graph so the winding number must vary slowly as the loop shrinks. But the winding number is an integer. It can't change really slowly, it can only change by amounts of a whole integer. So the winding number can't change at all. Every loop in a simply-connected domain must have a winding number that's the same as the winding number of a loop that is just one point ie. zero.

On to the sphere. Here's a sphere with a vector field where all of the vectors point along lines of longitude to the north pole:
(Sorry about my poor quality drawing but I'm sure you know what vectors pointing north look like.)

At this point you may be tempted to say "aha! That's a continuous vector field on the sphere that's non-zero everywhere!" Alas, it's not defined everywhere. It's a vector field everywhere except at the north and south poles. If you're at the north pole, no non-zero vector can point north. And at the south pole every non-zero vector points north with no continuous way to pick just one.

Given any vector field on the Earth we can imagine slicing the earth through the equator and flattening out the surfaces of the northern and souther hemispheres as two separate disks. Here's what you get if you do this with the north vector field (ignoring the problems at the poles for now):

You reconstruct the Earth again by gluing the two disks together according to the orange arrows, and then inflating. Any vector field on the surface of the Earth gives rise to a pair of vector fields on disks like this. But there will be a constraint. The vectors around the boundary of the two disks will match. In fact, vectors at the opposite ends of the orange arrow have to match. But they won't necessarily be equal as drawn in this diagram because the disk for the southern hemisphere corresponds to a view from below.

Suppose we start at the point P and follow a loop eastwards along the equator. That's an anti-clockwise loop round the upper disk and simultaneously a clockwise loop round the lower disk. Here are the graphs:



In the upper map the loop gives rise to winding number one. But in the lower map we get winding number minus one. So here's an important lesson: the winding number makes perfect sense for a flat domain in the plane. But on the surface of 3D objects it depends on how you flatten out your map. In this case, the winding number on the upper map is 2 more than the winding number for the lower map. (Remember, these fields aren't defined at the poles so we haven't contradicted the original theorem that the winding number is zero for any vector field defined in a simply-connected domain.)

But here's the most important thing in this proof: the winding number for the upper hemisphere loop will be two more that the winding number for the lower hemisphere loop, no matter what vector field you have. This is because if you've travelled an angle θ around the equator, the vectors at opposite ends of the orange arrows will differ by an angle of 2θ. For example, once you're 90º around the earth, the north arrow is draw as a down-arrow in the upper graph and as an up-arrow in the lower graph. They're already 180º apart. You can see this is true for north pointing vectors literally by tracing with your fingers around the loops. It's also true for vectors pointing east. I'll leave that as an exercise for you, but here's a picture of some eastward vectors to get you started:
Along the equator, every vector on the surface is a linear combination of north and east vectors. So if it's true for both the north and east vectors then it must be true for all vectors. But if the graph for one picture of the equatorial loop has vectors that are 2θ more than the vectors for another graph, the first one must complete two revolutions more than the second one. So the first has a winding number two more than the second.

If you had a continuous vector field that really was non-zero over the entire sphere, cutting the sphere in half would give a pair of continuous vector fields defined on disks. As disks are simply-connected, the theorem we started with tells us they must both have winding number zero as you loop around them. But we've also just shown that looping round one has winding number two more than looping around the other. This is a contradiction. So there is no continuous vector field that is non-zero everywhere. ∎

If you get stuck above I strongly recommend trying to draw some continuous non-zero vector fields on the sphere, transferring them to disks, and counting winding numbers.

Notice how we've done more than prove the theorem. We now know that if we have a continuous vector field on a sphere we can find out whether to look for its zeros in the northern or southern hemisphere by computing the winding numbers as above. At least one of the two winding numbers must be non-zero and that tells us which hemisphere we can be sure contains a zero. The fact that the two winding numbers differ by two, and not by just one, also tells us a bit about the nature of the zeros. But that's another story. That two is also related to the fact that the Euler characteristic of the sphere is two. It's also related to the Lefschetz index

This proof is based on proofs I studied years ago relating to Chern classes. I recently became interested in Chern classes again because they play an important role in understanding phenomena in solid state physics such as the quantum Hall effect. That argument about slowly shrinking a loop leaving its winding number unchanged tells you a lot about slowly changing certain types of quantum system.

It's possible I completely messed up. Here's an "elementary" proof. It looks much harder than what I did. But I feel like I did faithfully capture, in pictures, an argument that's buried in Lectures on Riemann surfaces. And it seems to correctly reproduce the Lefschetz number of 2.

Saturday, April 07, 2012

Generalised entropy

Introduction
The entropy of a probability distribution can be seen as a measure of its uncertainty or a measure of the diversity of samples taken from it. Over the years I've talked lots about how probability theory gives rise to a monad. This suggests the possibility that maybe the notion of entropy can be generalised to monads other than probability. So here goes...

> {-# LANGUAGE MultiParamTypeClasses, FlexibleInstances, GeneralizedNewtypeDeriving #-}
> {-# LANGUAGE FunctionalDependencies, TypeSynonymInstances #-}

> import Control.Monad
> import Control.Monad.Writer hiding (lift)


Shannon entropy
I've talked in the past about how there is some trickiness with defining the probability monad in Haskell because a good implementation requires use of the Eq typeclass, and hence restricted monads. Restricted monads are possible through a bunch of methods, but this time I don't want them.

It's common to represent probability distributions on finite sets as lists of pairs where each pair (p, x) means x has a probability p. But I'm going to allow lists without the restriction that each x appears once and make my code work with these generalised distributions. When I compute the entropy, say, it will only be the usual entropy in the case that each x in the list is unique.

So here's our type and some instances for it:

> data P a = P [(a, Float)] deriving Show

> instance Functor P where
>      fmap f (P xs) = P [(f a, p) | (a, p) <- xs]

> instance Monad P where
>      return x = P [(x, 1)]
>      P xss >>= f = P [(y, p*q) | (pxs, p) <- xss, let P ys = f pxs, (y, q) <- ys]

We can easily compute the expected value of a distribution, and its entropy, like this:

> expectation0 (P xs) = sum [x*p | (x, p) <- xs]
> entropy0 (P xs) = -sum [if p==0 then 0 else p*log p/log 2.0 | (_, p) <- xs]

An important property of entropy is known as the grouping property which can be illustrated through an example tree like this:


The entropy for the probability distribution of the final leaves is the sum of two components: (1) the entropy of the branch at the root of the tree and (2) the expected entropy of the subtrees. Here's some corresponding code. First simple bernoulli trials:

> bernoulli p a b = P [(a, p), (b, 1-p)]

Now the branch at the root of the tree:

> root = bernoulli 0.3 False True

We can compute the entropy for the distrbution on the leaves:

> test1 = entropy0 $ do
>   x <- root
>   if x
>     then bernoulli 0.2 3 4
>     else bernoulli 0.4 5 6

Or the sum of the root entropy and the expected subtree entropy:

> test2 = entropy0 root + (expectation0 $ do
>   x <- root
>   if x
>     then return $ entropy0 (bernoulli 0.2 3 4)
>     else return $ entropy0 (bernoulli 0.4 5 6))

You can confirm for yourself that test1 == test2.

We can rewrite that a little. We're drawing True or False from root only to decide which distribution to use at the next stage. But we may as will pick the distribution itself at random. So define:

> dist = bernoulli 0.3 (bernoulli 0.4 5 6) (bernoulli 0.2 3 4)

And now we expect the equality of test3 and test4:

> test3 = entropy0 $ do
>   x <- dist
>   x

> test4 = entropy0 dist + (expectation0 $ do
>   x <- dist
>   return $ entropy0 x)

There's a more elegant way of writing this. Define:

> left0 dist = entropy0 (join dist)
> right0 dist = entropy0 dist+expectation0 (fmap entropy0 dist)

Now we expect left0 dist and right0 dist to always be equal. We've almost generalised to something that makes sense in the context of monads other than probability.

The algebra of a monad
Here are a couple of important properties of expectation0:

1. expectation0 (return d) = d
2. expectation0 (join d) = expectation0 (fmap expectation d)

In English: the expectation of certainty is just the certain value, and the expectation of an expectation is just the expectation. But these rules are precisely the conditions that define an -algebra, where is a monad.

So let's define a type class:

> class Algebra m a | m -> a where
>      expectation :: m a -> a

We'll assume that when m is a monad, any instance satisfies the two laws above. Here's the instance for probability:

> instance Algebra P Float where
>      expectation (P xs) = sum [x*p | (x, p) <- xs]

In keeping with the notion that entropy measure diversity let's also define:

> class Diverse m r | m -> r where
>      entropy :: m x -> r

with the instance:

> instance Diverse P Float where
>      entropy (P xs) = -sum [if p==0 then 0 else p*log p/log 2.0 | (_, p) <- xs]

It's not clear what laws we need but for now we'll assume a generalised entropy satisfies left dist == right dist :

> left dist = entropy (join dist)
> right dist = entropy dist+expectation (fmap entropy dist)

We'll call that the generalised grouping law.

Binary trees
It's not hard to find other structures that satisfy these laws if we cheat and use alternative structures to represent probabilities. For example We can make Tree an instance by assuming Fork represents a 50/50 chance of going one way or another:

> data Tree a = Leaf a | Fork (Tree a) (Tree a) deriving Show

> instance Functor Tree where
>      fmap f (Leaf a) = Leaf (f a)
>      fmap f (Fork l r) = Fork (fmap f l) (fmap f r)

> instance Monad Tree where
>      return x = Leaf x
>      Leaf a >>= f = f a
>      Fork l r >>= f = Fork (l >>= f) (r >>= f)

> instance Algebra Tree Float where
>      expectation (Leaf a) = a
>      expectation (Fork l r) = 0.5*expectation l+0.5*expectation r

> instance Diverse Tree Float where
>      entropy (Leaf a) = 0
>      entropy (Fork l r) = 1+0.5*entropy l+0.5*entropy r

Lists
We could make non-empty lists into an instance by assuming a uniform distribution on the list. But another way to measure the diversity is simply to count the elements. We subtract one so that [x] corresponds to diversity zero. This subtraction gives us a non-trivial instance:

> newtype L a = L [a] deriving (Show, Monad, Functor)

> instance Algebra L Int where
>      expectation (L xs) = sum xs

> instance Diverse L Int where
>      entropy (L xs) = length xs-1

Tsallis entropy
There are measures of diversity for probability distributions that are distinct from Shannon entropy. An example is Tsallis entropy. At this point I'd like a family of types parametrised by reals but Haskell doesn't support dependent types. So I'll just fix a real number q and we can define:

> q = 2.5

> data T a = T [(a, Float)] deriving Show

> instance Functor T where
>      fmap f (T xs) = T [(f a, p) | (a, p) <- xs]

> instance Monad T where
>      return x = T [(x, 1)]
>      T xss >>= f = T [(y, p*q) | (pxs, p) <- xss, let T ys = f pxs, (y, q) <- ys]

> instance Algebra T Float where
>      expectation (T xs) = sum [x*p**q | (x, p) <- xs]

> instance Diverse T Float where
>      entropy (T xs) = (1-sum [p**q | (_, p) <- xs])/(q-1)

And again we find our generalised grouping rule for entropy holds.

Operads
This is all derived from Tom Leinster's post last year at the n-category cafe. As I talked about here there's a close relationship between monads and operads. Operads area a bit like container monads where the containers don't contain anything, but just have holes where contents could be placed. This makes operads a better place to work because you don't have the awkward issue I started with: having to disallow lists of value/probability pairs where the same value can appear more than once. Nonetheless, in (unrestricted) Haskell monads you don't have Eq available so you can't actually have definitions of return or >>= that can notice the equality of two elements. If such definitions were possible, the grouping law would no longer work as stated above.

Crossed homomorphisms
The generalised grouping law even makes sense for very different monads. For the Reader monad the law gives the definition of a crossed homomorphism. It's pretty weird seeing a notion from group cohomology emerge like this and I recommend skipping to the final section unless you care about this sort of thing. But if you do, this is related to research I did a long time ago. This is to test that the Schwarzian derivative really does give rise to a crossed homomorphism.

Firstly let me set up some automatic differentiation code:

> data D a = D { re::a, im::a } deriving (Show, Ord, Eq)

> instance Num a => Num (D a) where
>      fromInteger n = D (fromInteger n) 0
>      D a a'+D b b' = D (a+b) (a'+b')
>      D a a'*D b b' = D (a*b) (a*b'+a'*b)
>      D a a'-D b b' = D (a-b) (a'-b')

> instance Fractional a => Fractional (D a) where
>      fromRational n = D (fromRational n) 0
>      D a a'/D b b' = let q = 1/b in D (a*q) ((-a*b'+a'*b)*q*q)

> lift x = D x 0

> d f x = im (f (D x 1))

> raised f = re . f . lift
> raised2 = raised . raised
> raised3 = raised2 . raised

The Cn are the n-times (automatically) differentiable functions. Unfortunately the Endo defined in Data.Monoid acts the wrong way round from what I want so I need a Dual:

> type C1 = Dual (Endo (D Double))
> type C3 = Dual (Endo (D (D (D Double))))
> type C4 = Dual (Endo (D (D (D (D Double)))))

> instance Eq (Endo (D Double))
> instance Ord (Endo (D Double))

A silly Show instance that simply evaluates a function at a number I chose randomly: 1.234.

> instance Show (Endo (D Double)) where
>         show (Endo f) = show (f 1.234)

> instance Num C1 where
>      fromInteger n = Dual (Endo (\x -> fromInteger n))
>      Dual (Endo  f)+Dual (Endo  g) = Dual (Endo  (\x -> f x + g x))
>      Dual (Endo  f)-Dual (Endo  g) = Dual (Endo  (\x -> f x - g x))
>      Dual (Endo  f)*Dual (Endo  g) = Dual (Endo  (\x -> f x * g x))

> instance Fractional C1 where
>      fromRational n = Dual (Endo (\x -> fromRational n))
>      Dual (Endo f)/Dual (Endo g) = Dual (Endo (\x -> f x / g x))

> newtype Q a = Q (Writer C4 a) deriving (Monad, Functor)

We can give Q a a geometrical interpretation. The underlying type is a pair (a, C4). If we think of elements of C4 as charts charts on a piece of Riemann surface then for any , an element of (a, C4) represents a local piece of a section of the th tensor power of the canonical bundle. Ie. we can think of it as representing . I'll concentrate on the case which gives quadratic differentials. We can think of an element of ((a, C4), C4) as forms where we're composing two charts. We can collapse down to an ordinary chart by using the chain rule. Here's the code:

> instance Algebra Q C1 where
>      expectation (Q ma) = let (Dual (Endo a), Dual (Endo f)) = runWriter ma
>                           in Dual (Endo (\x -> a (raised3 f x)*(raised2 (d f) x)^2))

Now we can define the Schwarzian derivative:

> schwarzian f x = let f0 = raised3 f x
>                      f1 = raised2 (d f) x
>                      f2 = raised (d $ d f) x
>                      f3 = (d $ d $ d f) x
>                  in f3/f1-1.5*(f2/f1)^2

And somwehat bizarrely, we now have a generalised entropy:

> instance Diverse Q C1 where
>      entropy (Q ma) = let (_, Dual (Endo f)) = runWriter ma
>                       in Dual (Endo (\x -> schwarzian f x))

This is the construction that gives rise to the Virasoro algebra which plays such an important role in String Theory.

Some tests
And here's a bunch of tests. I'd have used QuickCheck but it won't install for me today...

> test :: (Algebra m t, Diverse m t, Num t, Functor m, Monad m) => m (m x) -> IO ()
> test x = do
>      print (left x, right x)

> main = do
>      test $ L [L [1, 2, 3], L [2, 3, 4], L [1], L [5], L [2, 7::Int]]
>      test $ P [(P [(0, 0.5), (1, 0.5)], 0.5), (P [(2, 0.5), (3::Int, 0.5)], 0.5::Float)]
>      test $ T [(T [(0, 0.5), (1, 0.5)], 0.5), (T [(2, 0.5), (3::Int, 0.5)], 0.5::Float)]
>      test $ Leaf (Leaf 1 `Fork` Leaf 2) `Fork` Leaf (Leaf 3 `Fork` (Leaf 4 `Fork` Leaf 5))
>      test $ (Q (writer
>               (Q (writer (Dual (Endo (\x -> x)),
>                           Dual (Endo (\x -> x^2+1)))),
>                           Dual (Endo (\x -> (2+x)/(3+x*x))))) :: Q (Q C3))

Saturday, March 17, 2012

Overloading Python list comprehension

Introduction
Python is very flexible in the way it allows you to overload various features of its syntax. For example most of the binary operators can be overloaded. But one part of the syntax that can't be overloaded is list comprehension ie. expressions like [f(x) for x in y].

What might it mean to overload this notation? Let's consider something simpler first, overloading the binary operator +. The expression a+b is interpreted as a.__add__(b) if a is of class type. So overloading + means nothing more than writing a function. So if we can rewrite list comprehensions in terms of a function (or functions) then we can overload the notation by providing alternative definitions for those functions. Python doesn't provide a facility for doing this directly, but we can at least think about what it might mean to do this. Later we'll see how to tweak the Python interpreter to make it possible.

map
Consider the expression
[a for x in y]
Here the single letter variables are 'metavariables' representing fragments of Python code. To a good approximation this is equal to:
map(lambda x: a, y)
(BTW Everything I say here is "to a good approximation". Python is an incredibly complex language and I'm not good enough at it to make any categorical statements about when one fragment of code is the same as another.)

So it's tempting to see list comprehensions as syntactic sugar for map, in which case one approach to overloading comprehension is to consider interpreting it in terms of replacements for map. But this isn't a very powerful overloading. It just gives us a slightly different way to write something that's already straightforward.

concatMap
Another reason for not simply seeing list comprehension in terms of map is that nested list comprehensions need another operation. Consider
[(y, z) for y in [1, 2] for z in ['a', 'b']]
This isn't quite the same as
[[(y, z) for z in ['a', 'b']] for y in [1, 2]]
but it's close. The latter produces nested lists whereas the first gives one flat list. We can think of nested comprehensions as applying a flattening operation. Let's use list comprehension to implement flattening:
def concat(xs):
 return [y for x in xs for y in x]
We now write our nested comprehension as:
concat([[(y, z) for z in ['a', 'b']] for y in [1, 2]])
We know how to write non-nested comprehensions using map so we get:
concat(map(lambda y: [(y, z) for z in ['a', 'b']], [1, 2]))
And rewriting the inner comprehension we get:
concat(map(lambda y: map(lambda z: (y, z), ['a', 'b']), [1, 2]))
Every time we add another level of nesting we're going to need another concat. But the innermost map doesn't have a concat. Purely for reasons of symmetry we can ensure every map has a concat by enclosing the innermost element as a singleton list:
concat(map(lambda y: concat(map(lambda z: [(y, z)], ['a', 'b'])), [1, 2]))
Every map has a concat so we can simplify slightly. Let's define:
def concatMap(f, xs):
 return [f(y) for x in xs for y in x]

def singleton(x):
 return [x]
Our expression becomes:
concatMap(lambda y: concatMap(lambda z: singleton((y, z)), ['a', 'b']), [1, 2])
Importantly we've completely rewritten the comprehension in terms of concatMap and singleton. By changing the meaning of these functions we can change the meaning of comprehension notation, or at least we could if the Python interpreter defined comprehension this way. It doesn't, but we can still reason about it. Although any comprehension that doesn't use ifs can be rewritten to use these functions, I won't give a formal description of the procedure. Instead I'll provide code to perform the rewrite later. While I'm at it, I'll also handle the ifs.

Laws
Freely redefining singleton and concatMap to redefine comprehension could get weird. If we're going to redefine them we should at least try to define them so that list comprehension still has some familiar properties. For example, for y a list we usually expect:
y == [x for x in y]
In other words
y == concatMap(lambda x: singleton(x), y)
At this point I could give a whole bunch more laws but it's time to own up.

Monads
A pair of functions singleton and concatMap, along with a bunch of laws, are essentially the same thing as a monad. In Haskell, concatMap is usually called bind and singleton is called return. What I've done here is show how Wadler's Comprehending Monads paper might look like in Python. Haskell has specialised monad notation built into its grammar. But what's less well known is that so does Python! The catch is that although the grammar is right, the semantics can't be generalised beyond lists.

Monad-Python
One great thing about Python is that there seem to be libraries for working with every aspect of Python internals. So it's fairly easy to write a simple Python interpreter that rewrites list comprehensions to use singleton and concatMap. I've placed the source on github. Use mpython.py instead of python as your interpreter. I've tested it with Python 2.6 and 2.7.

When using mpython, list comprehension uses whatever definitions of __mapConcat__ and __singleton__ are currently in scope. By default they are the definitions I gave above so we get something close to the usual list comprehension.

An example of the kind of code you can run with mpython.py is:
import math

def __concatMap__(k, m):
  return lambda c:m(lambda a:k(a)(c))

def __singleton__(x):
  return lambda f:f(x)

def callCC(f):
  return lambda c:f(lambda a:lambda _:c(a))(c)

def __fail__():
  raise "Failure is not an option for continuations"

def ret(x):
  return __singleton__(x)

def id(x):
  return x

def solve(a, b, c):
  return callCC(lambda throw: [((-b-d)/(2*a), (-b+d)/(2*a))
                               for a0 in (throw("Not quadratic") if a==0 else ret(a))
                               for d2 in ret(b*b-4*a*c)
                               for d in (ret(math.sqrt(d2)) if d2>=0 else throw("No roots"))
                              ])

print solve(1, 0, -9)(id)
print solve(1, 1, 9)(id)
print solve(0, 1, 9)(id)
I have defined our functions so that comprehension syntax gives us the continuation monad. This makes continuation passing style relatively painless in Python. (At least easier than chaining many lambdas.) I have then defined callCC to be similar to its definition in Haskell. There are many uses for callCC including the implementation of goto. Above I use it in a trivial way to throw exceptions.

Conclusion
My script mpython.py is a long way from an industrial strength interpreter and I'm not proposing the above as an extension to Python. My goal was simply to show how Haskell-style monads are not as alien to Python as you might think. In fact, it's reasonable to say that Python already supports one flavour of specialised monad syntax. Most users don't realise it as such because it has been hard-wired to work with just one monad, lists.

BTW if you attempt to implement all of the other Haskell monads you'll find that Haskell behaves a little differently because of its laziness. You can recover some of that laziness by careful use of continuations in Python. But I've no time to go into that now.

Saturday, February 11, 2012

Using Lawvere theories to combine effects

> {-# LANGUAGE MultiParamTypeClasses, ExplicitForAll, RankNTypes, FlexibleInstances, FlexibleContexts, TypeSynonymInstances #-}

> import Data.Monoid
> import Data.Functor.Identity
> import Control.Monad.Writer

In an earlier post I talked about how monads arise from free algebras. Let me recap a bit.

In Part 1 I described algebras. They're sets with operations on them satisfying some laws. We can build new elements of an algebra from old ones by using its operations. Eg. if x and y are in an algebra then x `mappend` y must be in it too. Starting with a bunch of symbols, thought of as leaves, we can consider the set of all expressions trees we can build from them. If we consider pairs of trees to be equivalent if the laws say the corresponding expressions are equal, then the set of trees itself forms an algebra known as a free algebra (for the given theory).

Let's start with some code. This type class says that the type b has leaves of type a:

> class Free a b where
>   leaf :: a -> b

Effects from monoids
Now we can make the type of all trees built from Monoid operations and including all leaves of type a:

> data FreeMonoid a = FreeMonoid (forall b. (Monoid b, Free a b) => b)

And we have:

> instance Monoid (FreeMonoid a) where
>   mempty = FreeMonoid mempty
>   FreeMonoid a `mappend` FreeMonoid b = FreeMonoid (a `mappend` b)

Unfortunately elements like e1 and e2 two ought to be equal but Haskell doesn't know this:

> e1, e2 :: FreeMonoid Char
> e1 = FreeMonoid (leaf 'a' `mappend` (leaf 'b' `mappend` leaf 'c'))
> e2 = FreeMonoid ((leaf 'a' `mappend` leaf 'b') `mappend` leaf 'c')

Instead we can manually construct a type that does respect equality in monoids. Elements of FreeMonoid are binary trees with a `mappend` at each node. Associativity means that we can always replace a tree with an equivalent one where the left branch is a leaf. We can also use the laws to eliminate any occurrence of mempty. So every element of FreeMonoid a is equivalent to one of the form:
Leaf x1 `mappend` (Leaf x2 `mappend` (... mempty))


In other words, free monoids are lists. We can make this explicit. The standard prelude already makes [] an instance of Monoid so we just need:

> instance Free a [a] where
>       leaf x = [x]

Here's the isomorphism (modulo tree equivalence):

> iso1 :: FreeMonoid a -> [a]
> iso1 (FreeMonoid x) = x

> iso1' :: [a] -> FreeMonoid a
> iso1' [] = FreeMonoid mempty
> iso1' (a : as) = let FreeMonoid r = iso1' as
>                  in FreeMonoid (leaf a `mappend` r)

As I talked about in that earlier article, free algebras give monads and the trees representing expressions in the algebra can be thought of as abstract syntax trees for domain specific languages. In this case it's the usual list monad. So the Monoid type class gives us a language for talking about non-determinism. The operation mappend gives us a way to "fork" a process and mempty gives as a way to "kill a thread". Here's an example using non-determinism to search for some Pythagorean triples:

> test1 :: [(Int, Int, Int)]
> test1 = do
>   a <- return 3 `mappend` return 4
>   b <- return 4 `mappend` return 5
>   c <- return 5 `mappend` return 6
>   if a*a+b*b==c*c then return (a, b, c) else mempty

Effects form M-sets
We can do exactly the same for -sets.

> class Monoid m => MSet m s where
>       act :: m -> s -> s

> data FreeMSet w a = FreeMSet (forall b. (MSet w b, Free a b) => b)

> instance Monoid w => MSet w (FreeMSet w a) where
>   m `act` FreeMSet b = FreeMSet (m `act` b)

Again we have the problem that FreeMSet doesn't automatically make equivalent elements equal. But it's not hard to see that every element of FreeMSet is equivalent to one of the form:
m `act` (leaf x)
So the free -set on the set of variables is simply the set of pairs . This is the basis of Haskell's writer monad:

> instance Monoid w => MSet w (Writer w a) where
>   act w1 m = let (a, w2) = runWriter m in WriterT (Identity (a, w1 `mappend` w2))

> instance Monoid w => Free a (Writer w a) where
>   leaf x = return x

Here's the isomorphism (again treating equivalent elements of FreeMSet as equal):

> iso2 :: Monoid w => FreeMSet w a -> Writer w a
> iso2 (FreeMSet x) = x

> iso2' :: Writer w a -> FreeMSet w a
> iso2' m = let (a, w) = runWriter m in FreeMSet (act w (leaf a))

And now the -set operation gives us an interface to an effect. This time the side effect of accumulating in a monoid:

> test2 :: Writer String Int
> test2 = do
>   act "foo" (return ())
>   a <- return 2
>   act "bar" (return ())
>   b <- return (10*a)
>   return b

Combining effects

And now we can finally combine the two effects of non-determinism and accumulation. We make the free algebra that is both a monoid and an -set:

> data FreeMMonoid w a = FreeMMonoid (forall b. (Monoid b, MSet w b, Free a b) => b)

> instance Monoid w => Monoid (FreeMMonoid w a) where
>   mempty = FreeMMonoid mempty
>   FreeMMonoid a `mappend` FreeMMonoid b = FreeMMonoid (a `mappend` b)

> instance Monoid w => MSet w (FreeMMonoid w a) where
>   m `act` FreeMMonoid b = FreeMMonoid (m `act` b)

Again we have the problem that equivalent elements aren't recognised as equal so we have to manually find a suitable type. For this we need to use the compatibility notion I introduced in Part 1. We can take 2 variables and and write them in a 1 by 2 array:

Apply mappend horizontally and act vertically to get:
m `act` (x `mappend` y)
Now apply act vertically and then mappend horizontally to get:
(m `act` x) `mappend` (m `act` y)
The law we want is:
m `act` (x `mappend` y) == (m `act` x) `mappend` (m `act` y)
Given an arbitrary tree in FreeMMonoid we can use this law to "push" all occurrences of act inwards. Ultimately every element can be written uniquely in the form:
act m1 (leaf x1) `mappend` (act m2 (leaf x2) `mappend` (... mempty)


We can then use the same argument as above to show that we end up with a list of pairs of elements of . This is exactly what we get if we apply the WriterT monad transformer to []. Here are the relevant instances:

> instance Monoid w => Monoid (WriterT w [] a) where
>   mempty = WriterT []
>   WriterT xs `mappend` WriterT ys = WriterT (xs ++ ys)

> instance Monoid w => MSet w (WriterT w [] a) where
>   m `act` WriterT xs = WriterT $ map (\(x, w) -> (x, m `mappend` w)) xs

> instance Monoid w => Free a (WriterT w [] a) where
>   leaf x = return x

Here's the isomorphism though we won't use it:

> iso3 :: Monoid w => FreeMMonoid w a -> WriterT w [] a
> iso3 (FreeMMonoid x) = x

> iso3' :: Monoid w => WriterT w [] a -> FreeMMonoid w a
> iso3' m = let xws = runWriterT m in FreeMMonoid $
>     foldr mappend mempty $ map (\(x, w) -> act w (leaf x)) xws

The monad WriterT (Product Float) [] is in fact the probability monad. Here's an example of its use:

> coin :: (Monoid a, MSet (Product Float) a, Free Bool a) => a
> coin = act (Product 0.5 :: Product Float) (leaf False)
>            `mappend`
>        act (Product 0.5 :: Product Float) (leaf True)

Compute unnormalised conditional probability distribution on a pair of coin tosses given that first coin can't be True unless second one is:

> test3 :: WriterT (Product Float) [] (Bool, Bool)
> test3 = do
>   coin1 <- coin
>   coin2 <- coin
>   if coin1>coin2 then mempty else return (coin1, coin2)

(Compare with Eric Kidd's article that also 'refactors' probability theory.)

What just happened?
Something miraculous just happened though it may have been lost in the details. We combined the list monad and the writer monad to get a new monad. We did it without using monad transformers and without specifying an order for the two monads. It just so happens in this case that the result was the same as using a monad transformer.

M-set with M-set
We can try other products of theories. It's tricky to deal with a theory combined with itself because repeating a type class in a context doesn't do anything. We need to make another type class that looks exactly like MSet but with different names. The result is that the product of the theory of -sets and the theory of -sets is the theory of -sets. This agrees with what we'd get from using monad transformers. It also agrees with intuition. -sets correspond to the effect of accumulating data in a monoid. The product theory corresponds to using two accumulators simultaneously.

(This makes me think type classes should take as arguments the name of the operations within them. That way a type can be an instance of the same type class in multiple ways. Compare with Agda modules.)

Monoid with monoid
This example illustrates why we can't expect a programming language to use the above method to combine theories. If an algebra has two multiplication operators with identities on it, and the two operators are compatible, then something surprising happens. The multiplications turn out to be the same operation. What's more, the operation is commutative. So the product of the theory of monoids with itself is the theory of commutative monoids. A free commutative monoid is a multiset. Multisets require a very different implementation to lists and I doubt any automatic algebra combiner in the near future could discover one. (The Eckmann-Hilton argument also appears here.)

The compatibility condition
To form the product of two theories we add in extra laws to ensure commutativity. If we don't add in such laws we get the sum of two theories. For the example theories I used here these theories can lead to quite complex types. For example the sum of the theory of -sets and -sets is, I think, the theory of -sets where is the "free product" of monoids. I this is a bit of a messy object from the perspective of types. Other effects, however, may behave nicely with respect to . I haven't yet investigated.

Conclusion
If you don't mind computing the relevant types by hand there are perfectly good alternative to monad transformers for combining effects. But it seems very difficult to automatically combine theories. In fact, I expect finding canonical forms for the elements of free algebras for a product theory isn't even computable. So this approach isn't going to replace monad transformers any time soon.

Exercise
Make a multiplication table showing the result of forming the product of algebras for lots of useful effects.

Blog Archive