A Neighborhood of Infinity

What does it take to be a hero? and other questions from statistical mechanics.

2024-09-15T12:03:00.000-07:00

1 We only hear about the survivors

In the classic Star Trek episode Errand of Mercy, Spock computes the chance of success:

CAPTAIN JAMES T. KIRK : What would you say the odds are on our getting out of here?

MR. SPOCK : Difficult to be precise, Captain. I should say, approximately 7,824.7 to 1.

And yet they get out of there. Are Spock’s probability computations unreliable? Think of it another way. The Galaxy is a large place. There must be tens of thousands of Spocks, and Grocks, and Plocks out there on various missions. But we won’t hear (or don’t want to hear) about the failures. So they may all be perfectly good at probability theory, but we’re only hearing about the lucky ones. This is an example of survivor bias.

2 Simulation

We can model this. I’ve written a small battle simulator for a super-simple made up role-playing game...

And the rest of this article can be found at github

(Be sure to download the actual PDF if you want to be able to follow links.)

How to hide information from yourself in a solo RPG

2024-09-07T16:06:00.000-07:00

A more stable version of this article can be found on github.

The Problem

Since the early days of role-playing games there has been debate over which rolls the GM should make and which are the responsibility of the players. But I think that for “perception” checks it doesn’t really make sense for a player to roll. If, as a player, you roll to hear behind a door and succeed, but you’re told there is no sound, then you know there is nothing to be heard. But you ought to just be left in suspense.

If you play a solo RPG the situation is more challenging. If there is a probability p of a room being occupied, and probability q of you hearing the occupant if you listen at the door, how can you simulate listening without making a decision about whether the room is occupied before opening the door? I propose a little mathematical trick.

Helena Listening, by Arthur Rackham

Simulating conditional probabilities

Suppose P(M) = p and P(H|M) = q (and P(H|not M) = 0). Then P(H) = pq. So to simulate the probability of hearing something at a new door: roll to see if a monster is present, and then roll to hear it. If both come up positive then you hear a noise.

But...but...you object, if the first roll came up positive you know there is a monster, removing the suspense if the second roll fails. Well this process does produce the correct (marginal) probability of hearing a noise at a fresh door. So you reinterpret the first roll not as determining whether a monster is present, but as just the first step in a two-step process to determine if a sound is heard.

But what if no sound is heard and we decide to open the door? We need to reduce the probability that we find a monster behind the door. In fact we need to sample P(M|not H). We could use Bayes’ theorem to compute this but chances are you won’t have any selection of dice that will give the correct probability. And anyway, you don’t want to be doing mathematics in the middle of a game, do you?

There’s a straightforward trick. In the event that you heard no noise at the door and want to now open the door: roll (again) to see if there is a monster behind the door, and then roll to listen again. If the outcome of the two rolls matches the information that you know, ie. it predicts you hear nothing, then you can now accept the first roll as determining whether the monster is present. In that case the situation is more or less vacuously described by P(M|not H). If the two rolls disagree with what you know, ie. they predict you hear something, then repeat the roll of two dice. Keep repeating until it agrees with what you know.

In general

There is a general method here though it’s only practical for simple situations. If you need to generate some hidden variables as part of a larger procedure, just generate them as usual, keep the variables you observe, and discard the hidden part. If you ever need to generate those hidden variables again, and remain consistent with previous rolls, resimulate from the beginning, restarting the rolls if they ever disagree with your previous observations.

In principle you could even do something like simulate an entire fight against a creature whose hit points remain unknown to you. But you’ll spend a lot of time rerolling the entire fight from the beginning. So It’s better for situations that only have a small number of steps, like listening at a door.

What does it mean for a monad to be strong?

2023-08-21T08:23:00.001-07:00

This is something I put on github years ago but I probably should have put it here.

Here's an elementary example of the use of the list monad:

> test1 = do
>   x <- [1, 2]
>   y <- [x, 10*x]
>   [x*y]

We can desugar this to:

> test2 = [1, 2] >>= \x -> [x, 10*x] >>= \y -> [x*y]

It looks like we start with a list and then apply a sequence (of length 2) of functions to it using bind (>>=). This is probably why some people call monads workflows and why the comparison has been made with Unix pipes.

But looks can be deceptive. The operator (>>=) is right associative and test2 is the same as test3:

> test3 = [1, 2] >>= (\x -> [x, 10*x] >>= \y -> [x*y])

You can try to parenthesise the other way:

> -- test4 = ([1, 2] >>= \x -> [x, 10*x]) >>= \y -> [x*y]

We get a "Variable not in scope: x" error. So test1 doesn't directly fit the workflow model. When people give examples of how workflow style things can be seen as monads they sometimes use examples where later functions don't refer to variables defined earlier. For example at the link I gave above the line m >>= x-> (n >>= y-> o) is transformed to (m >>= x-> n) >>= y-> o which only works if o makes no mention of x. I found similar things to be true in a number of tutorials, especially the ones that emphasise the Kleisli category view of things.

But we can always "reassociate" to the left with a little bit of extra work. The catch is that the function above defined by y-> ... "captures" x from its environment. So it's not just one function, it's a family of functions parameterised by x. We can fix this by making the dependence on x explicit. We can then pull the inner function out as it's no longer implicitly dependent on its immediate context. When compilers do this it's called lambda lifting.

Define (the weirdly named function) strength by

> strength :: Monad m => (x, m y) -> m (x, y)
> strength (x, my) = do
>   y <- my
>   return (x, y)

It allows us to smuggle x "into the monad".

And now we can rewrite test1, parenthesising to the left:

> test5 = ([1, 2] >>= \x -> strength (x, [x, 10*x])) >>= \(x, y) -> [x*y]

This is much more like a workflow. Using strength we can rewrite any (monadic) do expression as a left-to-right workflow, with the cost of having to throw in some applications of strength to carry along all of the captured variables. It's also using a composition of arrows in the Kleisli category.

A monad with a strength function is called a strong monad. Clearly all Haskell monads are strong as I wrote strength to work with any Haskell monad. But not all monads in category theory are strong. It's a sort of hidden feature of Haskell (and the category Set) that we tend not to refer to explicitly. It could be said that we're implicitly using strength whenever we refer to earlier variables in our do expressions.

Constructing Clifford Algebras using the Super Tensor Product

2023-03-05T11:38:00.007-08:00

Google have stopped supporting the Chart API so all of the mathematics notation below is missing. There is a PDF version of this article at GitHub.

Some literate Haskell but little about this code is specific to Haskell...

> {-# LANGUAGE DataKinds #-}
> {-# LANGUAGE TypeFamilies #-}
> {-# LANGUAGE TypeOperators #-}
> {-# LANGUAGE UndecidableInstances #-}
> 
> import GHC.TypeLits

Introduction

This is a followup to Geometric Algebra for Free and More Low Cost Geometric Algebra.

In those articles I showed how you could build up the Clifford algebras like so:

type Cliff1  = Complex R
type Cliff1' = Split R
type Cliff2  = Quaternion R
type Cliff2' = Matrix R
type Cliff3  = Quaternion Cliff1'
type Cliff3' = Matrix Cliff1
type Cliff4  = Quaternion Cliff2'
type Cliff4' = Matrix Cliff2
type Cliff5  = Quaternion Cliff3'
...

I used CliffN as the Clifford algebra for a negative definite inner product and CliffN' for the positive definite case. It's not a completely uniform sequence in the sense that CliffN is built from CliffN' for dimension two lower and you use a mix of Matrix and Quaternion.

The core principle making this work is that for type constructors implemented like Matrix, Quaternion etc. we have the property that

eg. Matrix (Quaternion Float) is effectively the same thing as Matrix Float Quaternion Float.

But John Baez pointed out to me that you can build up the CliffN algebras much more simply enabling us to use these definitions:

> type Cliff1 = Complex Float
> type Cliff2 = Complex Cliff1
> type Cliff3 = Complex Cliff2
> type Cliff4 = Complex Cliff3
> type Cliff5 = Complex Cliff4

...

Or even better:

> type family Cliff (n :: Nat) :: * where
>   Cliff 0 = Float
>   Cliff n = Complex (Cliff (n - 1))

But there's one little catch. We have to work, not with the tensor product, but the super tensor product.

We define Complex the same way as before:

> data Complex a = C a a deriving (Eq, Show)

Previously we used a definition of multiplication like this:

instance Num a => Num (Complex a) where
  C a b * C c d = C (a * c - b * d) (a * d + b * c)

We can think of C a b in Complex R as representing the element . The definition of multiplication in a tensor product of algebras is defined by . So we have .

This means that line of code we wrote above defining * for Complex isn't simply a definition of multiplication of complex numbers, it says how to multiply in an algebra tensored with the complex numbers.

Let's go Super!

A superalgebra is an algebra graded by where is the ring of integers modulo 2. What that means is that we have some algebra that can be broken down as a direct sum (the subscripts live in ) with the property that multiplication respects the grading, ie. if is in and is in then is in .

The elements of are called "even" (or bosonic) and those in "odd" (or fermionic). Often even elements commute with everything and odd elements anticommute with each other but this isn't always the case. (The superalgebra is said to be supercommutative when this happens. This is a common pattern: a thing X becomes a superX if it has odd and even parts and swapping two odd things introduces a sign flip.)

The super tensor product is much like the tensor product but it respects the grading. This means that if is in and is in then is in . From now on I'm using to mean super tensor product.

Multiplication in the super tensor product of two superalgebras and is now defined by the following modified rule: if is in and is in then . Note that the sign flip arises when we shuffle an odd left past an odd .

The neat fact that John pointed out to me is that .

We have to modify our definition of * to take into account that sign flip.

I initially wrote a whole lot of code to define a superalgebra as a pair of algebras with four multiplication operations and it got a bit messy. But I noticed that the only specifically superalgebraic operation I ever performed on an element of a superalgebra was negating the odd part of an element.

So I could define SuperAlgebra like so:

class SuperAlgebra a where
  conjugation :: a -> a

where conjugation is the negation of the odd part.

(I'm not sure if this operation corresponds to what is usually called conjugation in this branch of mathematics.)

But there's a little efficiency optimization I want to write. If I used the above definition, then later I'd often find myself computing a whole lot of negates in a row. This means applying negate to many elements of large algebraic objects even though any pair of them cancel each other's effect. So I add a little flag to my conjugation function that is used to say we want an extra negate and we can accumulate flips of a flag rather than flips of lots of elements.

> class SuperAlgebra a where
>   conjugation :: Bool -> a -> a

Here's our first instance:

> instance SuperAlgebra Float where
>   conjugation False x = x
>   conjugation True x = negate x

This is saying that the conjugation is the identity on Float but if we want to perform an extra flip we can set the flag to True. Maybe I should call it conjugationWithOptionalExtraNegation.

And now comes the first bit of non-trivial superalgebra:

> instance (Num a, SuperAlgebra a) => SuperAlgebra (Complex a) where
>   conjugation e (C a b) = C (conjugation e a) (conjugation (not e) b)

We consider to be even and to be odd. When we apply the conjugation to then we can just apply it directly to . But that flips the "parity" of (because tensor product respects the grading) so we need to swap when we use the conjugation. And that should explain why conjugation is defined the way it is.

Now we can use the modified rule for defined above:

> instance (Num a, SuperAlgebra a) => Num (Complex a) where
>   fromInteger n = C (fromInteger n) 0
>   C a b + C a' b' = C (a + a') (b + b')
>   C a b * C c d = C (a * c - conjugation False b * d)
>                     (conjugation False a * d + b * c) 
>   negate (C a b) = C (negate a) (negate b)
>   abs = undefined
>   signum = undefined

For example, conjugation False is applied to the first on the RHS because implicitly represents an term and when expanding out the product we shuffle the (odd) in left of . It doesn't get applied to the second because and remain in the same order.

That's it!

Tests

I'll test it with some examples from Cliff3:

> class HasBasis a where
>   e :: Integer -> a


> instance HasBasis Float where
>   e = undefined


> instance (Num a, HasBasis a) => HasBasis (Complex a) where
>   e 0 = C 0 1
>   e n = C (e (n - 1)) 0


> make a b c d e f g h =
>   C (C (C a b) (C c d))
>     (C (C e f) (C g h))


> e1, e2, e3, e21, e31, e32, e321 :: Cliff 3
> e1 = e 0
> e2 = e 1
> e21 = e2 * e1
> e3 = e 2
> e31 = e3 * e1
> e32 = e3 * e2
> e321 = e3 * e2 * e1


> main = do
>     print (e1 * e1 + 1 == 0)
>     print (e31 * e31 + 1 == 0)
>     print (e3 * e3 + 1 == 0)
>     print (e21 * e21 + 1 == 0)
>     print (e2 * e2 + 1 == 0)
>     print (e32 * e32 + 1 == 0)
>     print (e321 * e321 - 1 == 0)
>     print (e3 * e2 * e1 - e321 == 0)
>     print (e2 * e1 - e21 == 0)
>     print (e3 * e1 - e31 == 0)
>     print (e3 * e2 - e32 == 0)
>     print (e21 * e32 - e31 == 0)

Observation

The implementation of multiplication looks remarkably like it's the Cayley-Dickson construction. It can't be (because iterating it three times gives you a non-associative algebra but the Clifford algebras are associative). Nonetheless, I think comparison with Cayley-Dickson may be useful.

Efficiency

As mentioned above, before I realised I just needed the conjugation operation I wrote the above code with an explicit split of a superalgebra into two pieces intertwined by four multiplications. I think the previous approach may have a big advantage - it may be possible to use variations on the well known "speed-up" of complex multiplication that uses three real multiplications instead of four. This should lead to a fast implementation of Clifford algebras.

Also be warned: you can kill GHC if you turn on optimization and try to multiply elements of high-dimensional Clifford algebras. I think it tries to inline absolutely everything and you end up with a block of code that grows exponentially with .

Note also that this code translates directly into many languages.

Some pointers to things not in this blog

2020-09-05T12:45:00.003-07:00

Some pointers to things not in this blog

One reason I haven't blogged much recently is that my tolerance for blogger.com has reached its limit and I've been too lazy to build my own platform supporting mathematics and code. (For example, I can't get previewing on blogger to work today so I'm just publishing this and hope the reformatting is acceptable.) But that doesn't mean I haven't posted stuff publicly. So here are some thematically related links to things I've written on github and colab.

Continuations, effects and runners

How to slice your code into continuations

Handling Effects with Jax

Are these runners?
Just a little snippet of code to illustrate how Python's coroutines can be used to support composable runners. See http://math.andrej.com/2019/10/28/runners-in-action/
(The answer is yes.)

Parallel audio

Parallel audio synthesis

Simulating Plucked Strings with TensorFlow

FWIW I think Colab might be my favourite place to share stuff publicly if it supported environments other than Python.

Why is nuclear fusion so hard?

2018-12-08T12:47:00.000-08:00

Why does water fall out of an inverted cup?

Before considering nuclear fusion, let's consider something much more familiar. If you turn a cup full of water upside down, the water falls out. Why is this? It seems obvious: with nothing supporting the water from below, gravity pulls the water out of the cup. But let's look a little closely at the left side of Figure 1. We have water in an inverted cup under the influence of gravity. But there is air pressure at the surface of the water. At sea level, atmospheric pressure is about the weight of 1kg per cm², easily enough to keep the water in the cup. So now we still have our question: why does it come out?

Consider the diagram on the right of Figure 1. I have drawn some ripples on the surface of the water. Air pressure provides a force perpendicular to the water surface which means that around the ripples we no longer have a force pointing straight up. The force points partly sideways and this serves to deform the shape of the water surface. But as the water surface becomes even more deformed the forces become even more distorted away from vertical causing a feedback loop. So we can expect even the tiniest of ripples to grow to the point where the water completely changes shape and it eventually deforms its way out of the glass.

Figure 1

This is an example of an instability - in this case it's called the Rayleigh-Taylor instability. And this suggests an approach to keeping water in an inverted cup - suppress the ripples. One way to do this is to fill the cup to the top and place a piece of paper over the surface. Here is an example on YouTube:

Another approach is to apply thickening agents to the water to dampen the ripples. Here's an example showing how hard it is to get jello out of a bowl:

Nuclear fusion

And now I can discuss one of the challenges facing controlled nuclear fusion. When two suitable atomic nuclei are made to fuse they release energy. But nuclei are positively charged and repel. So in order to cause nuclei to fuse we need to overcome this repulsion, typically by making them collide at very high velocities. To achieve this in useful numbers one way is to heat a suitable fuel to a high temperature in a small enough volume. Temperatures of at least 10,000,000K are required, and possibly orders of magnitude higher. We know how to do the heating, but we also have to confine the fuel - which at these temperatures takes the form of a plasma because its nuclei and electrons become separated. We can't simply use a solid container to hold the plasma as no container can survive such temperatures. Instead we need some other confinement method such as magnetic confinement, where we use magnetic fields to control the plasma, or inertial confinement where we use explosions around the fuel to drive the fuel into a small volume. Both of these suffer from the same problem: we're using a force to control a fluid and this is subject to instabilities. Confining a plasma is like trying to keep water in an inverted cup by blowing on the water with fans. It's really hard. What's more, plasma has incredibly complex dynamics that is subject to many different kinds of instability. Around fifty named instabilities are listed on Wikipedia. Here is an example of one in a real plasma. It is a kink instability that causes a line of plasma with a current through it to start acquiring a corkscrew shape so that it eventually tears itself apart:

And that's one of the biggest challenges facing fusion energy today: we don't know how to keep a plasma stable for long enough.

Is there any hope for fusion?

The description above is qualitative. To determine how severe any particular instability is we need to run experiments or use some theory. Some kinds of instability are slow enough that we can control them. For example bicycles are unstable, but many of us eventually learn to ride them. In a thermonuclear weapon stability is only required for a short amount of time, something that was achievable back in the 50s. And sometimes the feedback loops in the physics are less troublesome than might be expected so that Tokamaks (and other configurations) have operating modes that turn out to be relatively stable, for example the "H-mode". So maybe we'll have working fusion reactors in 20 years, or 30 years, or is it 50?

Running from the past

2018-10-20T16:22:00.000-07:00

Important Note

Google have stopped supporting the Chart API so all of the mathematics notation below is missing. There is a PDF version of this article at GitHub.

Preface

Functional programming encourages us to program without mutable state. Instead we compose functions that can be viewed as state transformers. It's a change of perspective that can have a big impact on how we reason about our code. But it's also a change of perspective that can be useful in mathematics and I'd like to give an example: a really beautiful technique that alows you to sample from the infinite limit of a probability distribution without needing an infinite number of operations. (Unless you're infinitely unlucky!)

Markov Chains

A Markov chain is a sequence of random states where each state is drawn from a random distribution that possibly depends on the previous state, but not on any earlier state. So it is a sequence such that for all . A basic example might be a model of the weather in which each day is either sunny or rainy but where it's more likely to be rainy (or sunny) if the previous day was rainy (or sunny). (And to be technically correct: having information about two days or earlier doesn't help us if we know yesterday's weather.)

Like imperative code, this description is stateful. The state at step depends on the state at step . Probability is often easier to reason about when we work with independent identically drawn random variables and our aren't of this type. But we can eliminate the state from our description using the same method used by functional programmers.

Let's choose a Markov chain to play with. I'll pick one with 3 states called , and and with transition probabilities given by where

Here's a diagram illustrating our states:

Implementation

First some imports:

> {-# LANGUAGE LambdaCase #-}
> {-# LANGUAGE TypeApplications #-}


> import Data.Sequence(replicateA)
> import System.Random
> import Control.Monad.State
> import Control.Monad
> import Data.List
> import Data.Array

And now the type of our random variable:

> data ABC = A | B | C deriving (Eq, Show, Ord, Enum, Bounded)

We are now in a position to simulate our Markov chain. First we need some random numbers drawn uniformly from [0, 1]:

> uniform :: (RandomGen gen, MonadState gen m) => m Double
> uniform = state random

And now the code to take a single step in the Markov chain:

> step :: (RandomGen gen, MonadState gen m) => ABC -> m ABC
> step A = do
>     a <- uniform
>     if a < 0.5
>         then return A
>         else return B
> step B = do
>     a <- uniform
>     if a < 1/3.0
>         then return A
>         else if a < 2/3.0
>             then return B
>             else return C
> step C = do
>     a <- uniform
>     if a < 0.5
>         then return B
>         else return C

Notice how the step function generates a new state at random in a way that depends on the previous state. The m ABC in the type signature makes it clear that we are generating random states at each step.

We can simulate the effect of taking steps with a function like this:

> steps :: (RandomGen gen, MonadState gen m) => Int -> ABC -> m ABC
> steps 0 i = return i
> steps n i = do
>     i <- steps (n-1) i
>     step i

We can run for 100 steps, starting with , with a line like so:

*Main> evalState (steps 3 A) gen
B

The starting state of our random number generator is given by gen.

Consider the distribution of states after taking steps. For Markov chains of this type, we know that as goes to infinity the distribution of the th state approaches a limiting "stationary" distribution. There are frequently times when we want to sample from this final distribution. For a Markov chain as simple as this example, you can solve exactly to find the limiting distribution. But for real world problems this can be intractable. Instead, a popular solution is to pick a large and hope it's large enough. As gets larger the distribution gets closer to the limiting distribution. And that's the problem I want to solve here - sampling from the limit. It turns out that by thinking about random functions instead of random states we can actually sample from the limiting distribution exactly.

Some random functions

Here is a new version of our random step function:

> step' :: (RandomGen gen, MonadState gen m) => m (ABC -> ABC)
> step' = do
>     a <- uniform
>     return $ \case
>         A -> if a < 0.5 then A else B
>         B -> if a < 1/3.0
>                 then A
>                 else if a < 2/3.0 then B else C
>         C -> if a < 0.5 then B else C

In many ways it's similar to the previous one. But there's one very big difference: the type signature m (ABC -> ABC) tells us that it's returning a random function, not a random state. We can simulate the result of taking 10 steps, say, by drawing 10 random functions, composing them, and applying the result to our initial state:

> steps' :: (RandomGen gen, MonadState gen m) => Int -> m (ABC -> ABC)
> steps' n = do
>   fs <- replicateA n step'
>   return $ foldr (flip (.)) id fs

Notice the use of flip. We want to compose functions , each time composing on the left by the new . This means that for a fixed seed gen, each time you increase by 1 you get the next step in a single simulation: (BTW I used replicateA instead of replicateM to indicate that these are independent random draws. It may be well known that you can use Applicative instead of Monad to indicate independence but I haven't seen it written down.)

*Main> [f A | n <- [0..10], let f = evalState (steps' n) gen]
[A,A,A,B,C,B,A,B,A,B,C]

When I first implemented this I accidentally forgot the flip. So maybe you're wondering what effect removing the flip has? The effect is about as close to a miracle as I've seen in mathematics. It allows us to sample from the limiting distribution in a finite number of steps!

Here's the code:

> steps_from_past :: (RandomGen gen, MonadState gen m) => Int -> m (ABC -> ABC)
> steps_from_past n = do
>   fs <- replicateA n step'
>   return $ foldr (.) id fs

We end up building . This is still a composition of independent identically distributed functions and so it's still drawing from exactly the same distribution as steps'. Nonetheless, there is a difference: for a particular choice of seed, steps_from_past n no longer gives us a sequence of states from a Markov chain. Running with argument draws a random composition of functions. But if you increase by 1 you don't add a new step at the end. Instead you effectively restart the Markov chain with a new first step generated by a new random seed.

Try it and see:

*Main> [f A | n <- [0..10], let f = evalState (steps_from_past n) gen]
[A, A, A, A, A, A, A, A, A, A]

Maybe that's surprising. It seems to get stuck in one state. In fact, we can try applying the resulting function to all three states.

*Main> [fmap f [A, B, C] | n <- [0..10], let f = evalState (steps_from_past n) gen]
[[A,B,C],[A,A,B],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A]]

In other words, for large enough we get the constant function.

Think of it this way: If f isn't injective then it's possible that two states get collapsed to the same state. If you keep picking random f's it's inevitable that you will eventually collapse down to the point where all arguments get mapped to the same state. Once this happens, we'll get the same result no matter how large we take . If we can detect this then we've found the limit of as goes to infinity. But because we know composing forwards and composing backwards lead to draws from the same distribution, the limiting backward composition must actually be a draw from the same distribution as the limiting forward composition. That flip can't change what probability distribution we're drawing from - just the dependence on the seed. So the value the constant function takes is actually a draw from the limiting stationary distribution.

We can code this up:

> all_equal :: (Eq a) => [a] -> Bool
> all_equal [] = True
> all_equal [_] = True
> all_equal (a : as) = all (== a) as


> test_constant :: (Bounded a, Enum a, Eq a) => (a -> a) -> Bool
> test_constant f =
>     all_equal $ map f $ enumFromTo minBound maxBound

This technique is called coupling from the past. It's "coupling" because we've arranged that different starting points coalesce. And it's "from the past" because we're essentially asking answering the question of what the outcome of a simulation would be if we started infinitely far in the past.

> couple_from_past :: (RandomGen gen, MonadState gen m, Enum a, Bounded a, Eq a) =>
>                   m (a -> a) -> (a -> a) -> m (a -> a)
> couple_from_past step f = do
>     if test_constant f
>         then return f
>         else do
>             f' <- step
>             couple_from_past step (f . f')

We can now sample from the limiting distribution a million times, say:

*Main> let samples = map ($ A) $ evalState (replicateA 1000000 (couple_from_past step' id)) gen

We can now count how often A appears:

*Main> fromIntegral (length $ filter (== A) samples)/1000000
0.285748

That's a pretty good approximation to , the exact answer that can be found by finding the eigenvector of the transition matrix corresponding to an eigenvalue of 1.

> gen = mkStdGen 669

Notes

The technique of coupling from the past first appeared in a paper by Propp and Wilson. The paper Iterated Random Functions by Persi Diaconis gave me a lot of insight into it. Note that the code above is absolutely not how you'd implement this for real. I wrote the code that way so that I could switch algorithm with the simple removal of a flip. In fact, with some clever tricks you can make this method work with state spaces so large that you couldn't possibly hope to enumerate all starting states to detect if convergence has occurred. Or even with uncountably large state spaces. But I'll let you read the Propp-Wilson paper to find out how.

A tail we don't need to wag

2017-10-14T13:42:00.000-07:00

Introduction

I've been reading a little about concentration inequalities recently. I thought it would be nice to see if you can use the key idea, if not the actual theorems, to reduce the complexity of computing the probability distribution of the outcome of stochastic simulations. Examples might include random walks, or queues.

The key idea behind concentration inequalities is that very often most of the probability is owned by a small proportion of the possible outcomes. For example, if we toss a fair coin enough (say ) times we expect the number of heads to lie within of the mean about 99.99% of the time despite there being different total numbers possible. The probable outcomes tend to concentrate around the expectation. On the other hand, if we consider not the total number of heads, but the possible sequences of tosses, there are possibilities, all equally likely. In this case there is no concentration. So a key ingredient here is a reduction operation: in this case reducing a sequence of tosses to a count of the number that came up heads. This is something we can use in a computer program.

I (and many others) have written about the "vector space" monad that can be used to compute probability distributions of outcomes of simulations and I'll assume some familiarity with that. Essentially it is a "weighted list" monad which is similar to the list monad except that in addition to tracking all possible outcomes, it also propagates a probability along each path. Unfortunately it needs to follow through every possible path through a simulation. For example, in the case of simulating coin tosses it needs to track different possiblities, even though we're only interested in the possible sums. If, after each bind operation of the monad, we could collect together all paths that give the same total then we could make this code much more efficient. The catch is that to collect together elements of a type the elements need to be comparable, for example instances of Eq or Ord. This conflicts with the type of Monad which requires that we can use the >>= :: m a -> (a -> m b) -> m b and return :: a -> m a functions with any types a and b.

I'm going to deal with this by adapting a technique presented by Oleg Kiselyov for efficiently implementing the Set monad. Instead of Set I'm going to use the Map type to represent probability distributions. These will store maps saying, for each element of a type, what the probability of that element is. So part of my code is going to be a direct translation of that code to use the Map type instead of the Set type.

> {-# LANGUAGE GADTs, FlexibleInstances #-}
> {-# LANGUAGE ViewPatterns #-}


> module Main where


> import Control.Monad
> import Control.Arrow
> import qualified Data.Map as M
> import qualified Data.List as L

The following code is very similar to Oleg's. But for first reading I should point out some differences that I want you to ignore. The type representing a probability distribution is P:

> data P p a where
>     POrd :: Ord a => p -> M.Map a p -> P p a
>     PAny :: p -> [(a, p)] -> P p a

But note how the constructors take two arguments - a number that is a probability, in addition to a weighted Map or list. For now pretend that first argument is zero and that the functions called trimXXX act similarly to the identity:

> instance (Ord p, Num p) => Functor (P p) where
>     fmap = liftM


> instance (Ord p, Num p) => Applicative (P p) where
>     pure = return
>     (<*>) = ap


> instance (Ord p, Num p) => Monad (P p) where
>     return x = PAny 0 [(x, 1)]
>     m >>= f = 
>         let (e, pdf) = unP m
>         in trimAdd e $ collect $ map (f *** id) pdf


> returnP :: (Ord p, Num p, Ord a) => a -> P p a
> returnP a = POrd 0 $ M.singleton a 1


> unP :: P p a -> (p, [(a, p)])
> unP (POrd e pdf) = (e, M.toList pdf)
> unP (PAny e pdf) = (e, pdf)


> fromList :: (Num p, Ord a) => [(a, p)] -> M.Map a p
> fromList = M.fromListWith (+)


> union :: (Num p, Ord a) => M.Map a p -> M.Map a p -> M.Map a p
> union = M.unionWith (+)


> scaleList :: Num p => p -> [(a, p)] -> [(a, p)]
> scaleList weight = map (id *** (weight *))


> scaleMap :: (Num p, Ord a) => p -> M.Map a p -> M.Map a p
> scaleMap weight = fromList . scaleList weight . M.toList

This is a translation of Oleg's crucial function that allows us to take a weighted list of probability distributions and flatten them down to a single probability distribution:

> collect :: Num p => [(P p a, p)] -> P p a
> collect []  = PAny 0 []
> collect ((POrd e0 pdf0, weight) : rest) =
>     let wpdf0 = scaleMap weight pdf0
>     in case collect rest of
>       POrd e1 pdf1 -> POrd (weight*e0+e1) $ wpdf0 `union` pdf1
>       PAny e1 pdf1 -> POrd (weight*e0+e1) $ wpdf0 `union` fromList pdf1
> collect ((PAny e0 pdf0, weight) : rest) =
>     let wpdf0 = scaleList weight pdf0
>     in case collect rest of
>       POrd e1 pdf1 -> POrd (weight*e0+e1) $ fromList wpdf0 `union` pdf1
>       PAny e1 pdf1 -> PAny (weight*e0+e1) $ wpdf0 ++ pdf1

But now I really must explain what the first argument to POrd and PAny is and why I have all that "trimming".

Even though the collect function allows us to reduce the number of elements in our PDFs, we'd like to take advantage of concentration of probability to reduce the number even further. The trim function keeps only the top probabilities in a PDF, discarding the rest. To be honest, this is the only point worth taking away from what I've written here :-)

When we throw away elements of the PDF our probabilities no longer sum to 1. So I use the first argument of the constructors as a convenient place to store the amount of probability that I've thrown away. The trim function keeps the most likely outcomes and sums the probability of the remainder. I don't actually need to keep track of what has been discarded. In principle we could reconstruct this value by looking at how much the probabilities in our trimmed partial PDFs fall short of summing to 1. But confirming that our discarded probability and our partial PDF sums to 1 gives a nice safety check for our code and can give us some warning if numerical errors start creeping in. I'll call the total discarded probability the tail probability.

Here is the core function to keep the top values. In this case is given by a global constant called trimSize. (I'll talk about how to do this better later.)

> trimList :: (Ord p, Num p) => [(a, p)] -> (p, [(a, p)])
> trimList ps =
>     let (keep, discard) = L.splitAt trimSize (L.sortOn (negate . snd) ps)
>     in (sum (map snd discard), keep)


> trimAdd :: (Ord p, Num p) => p -> P p a -> P p a
> trimAdd e' (POrd e pdf) =
>     let (f, trimmedPdf) = trimList (M.toList pdf)
>     in POrd (e'+e+f) (M.fromList trimmedPdf)
> trimAdd e' (PAny e pdf) =
>     let (f, trimmedPdf) = trimList pdf
>     in PAny (e'+e+f) trimmedPdf


> runP :: (Num p, Ord a) => P p a -> (p, M.Map a p)
> runP (POrd e pdf) = (e, pdf)
> runP (PAny e pdf) = (e, fromList pdf)

And now some functions representing textbook probability distributions. First the uniform distribution on a finite set. Again this is very similar to Oleg's chooseOrd function apart from the fact that it assigns weights to each element:

> chooseP :: (Fractional p, Ord p, Ord a) =>
>            [a] -> P p a
> chooseP xs = let p = 1/fromIntegral (length xs)
>              in POrd 0 $ fromList $ map (flip (,) p) xs

And the Bernoulli distribution, i.e. tossing a Bool coin that comes up True with probability :

> bernoulliP :: (Fractional p, Ord p) =>
>               p -> P p Bool
> bernoulliP p = POrd 0 $ fromList $ [(False, 1-p), (True, p)]

Now we can try a random walk in one dimension. At each step we have a 50/50 chance of standing still or taking a step to the right:

> random_walk1 :: Int -> P Double Int
> random_walk1 0 = returnP 0
> random_walk1 n = do
>     a <- random_walk1 (n-1)
>     b <- chooseP [0, 1]
>     returnP $ a+b

Below in main we take 2048 steps but only track 512 probabilities. The tail probability in this case is about . So only tracking 1/4 of the outcomes has had almost no impact on the numbers. This also illustrates why it is good to track the tail probabilities rather than inferring them from the missing probabilities in the bulk of the PDF - they can be so small they vanish compared to floating poimnt errors. We can afford to track a lot fewer than 512 (out of 2049 possible) outcomes and still have a good representative PDF.

Now here's a two-dimensional random walk for 32 steps. The tail probability is about so we are getting a reasonably representative PDF. We have to run fewer steps than before, however, because the space of possible outcomes spans two dimensions, meaning that reduction doesn't help as much as it does in one dimension.

> random_walk2 :: Int -> (Int, Int) -> P Double (Int, Int)
> random_walk2 0 (x, y) = returnP (x, y)
> random_walk2 n (x, y) = do
>     (x',y') <- random_walk2 (n-1) (x, y)
>     dx <- chooseP [-1, 1]
>     dy <- chooseP [-1, 1]
>     returnP (x'+dx, y'+dy)

One last simulation. This is a queing scenario. Tasks come in once every tick of the clock. There are four queues a task can be assigned to. A task is assigned to the shortest queue. Meanwhile each queue as a 1/4 probability of clearing one item at each tick of the clock. We build the PDF for the maximum length any queue has at any time.

The first argument to queue is the number of ticks of the clock. The second argument is the list of lengths of the queues. It returns a PDF, not just on the current queue size, but also on the longest queue it has seen.

> queue :: Int -> [Int] -> P Double (Int, [Int])
> queue 0 ls = returnP (maximum ls, ls)
> queue n ls = do
>     (longest, ls1) <- queue (n-1) ls
>     ls2 <- forM ls1 $ \l -> do
>         served <- bernoulliP (1/4)
>         returnP $ if served && l > 0 then l-1 else l
>     let ls3 = L.sort $ head ls2+1 : tail ls2
>     returnP (longest `max` maximum ls3, ls3)

For the queing simulation the tail probability is around despite the fact that we have discarded a vast possible set of possible outcomes.

It's a little ugly that trimSize is a global constant:

> trimSize = 512

The correct solution is probably to separate the probability "syntax" from its "semantics". In other words, we should implement a free monad supporting the language of probability with suitable constructors for bernoulliP and choiceP. We can then write a separate interpreter which takes a trimSize as argument. This has another advantage too: the Monad above isn't a true monad. It uses a greedy approach to discarding probabilities and different rearrangements of the code, that ought to give identical results, may end up diferent. By using a free monad we ensure that our interface is a true monad and we can put the part of the code that breaks the monad laws into the interpreter. The catch is that my first attempt at writing a free monad resulted in code with poor performance. So I'll leave an efficient version as an exercise :-)

> main = do
>     print $ runP $ random_walk1 2048
>     print $ runP $ random_walk2 32 (0, 0)
>     print $ runP $ do
>         (r, _) <- queue 128 [0, 0, 0, 0]
>         returnP r

What is a photon?

2017-08-11T20:22:00.000-07:00

Introduction

Popular science writing about quantum mechanics leaves many people full of questions about the status of photons. I want to answer some of these without using any tricky mathematics.

One of the challenges is that photons are very different to ordinary everyday objects like billiard balls. This is partly because photons are described by quantum mechanics whereas billiard balls are better modelled with classical Newtonian mechanics. Quantum mechanics defies many of our intuitions. But it's also because the word photon plays by different linguistic rules to billiard ball. I hope to explain why.

One of my goals is to avoid saying anything original. I'm largely going remove the mathematics from material I first learnt from three or so courses I took at Cambridge University many years ago: Quantum Mechanics, Solid State Physics and Quantum Field Theory. I also learnt about some of this from David Miller at Stanford University who talked a little about what properties it is meaningful to apply to a photon. (I hope I haven't misrepresented him too badly.)

The simple harmonic oscillator

Here's a mass hanging on a spring:

Suppose it's initially sitting in equilibrium so that the net force acting on it is zero. Now we lift the mass a small distance and let it go. Because we lifted it, we shortened the spring, reducing its tension. This means the force due to gravity is now more than the spring tension and the mass falls. Eventually it falls below the equilibrium point, increasing the tension in the spring so there is a net force pulling it back up again. To a good approximation, the force restoring the mass to its equilibrium point is proportional to how far it has been displaced. When this happens we end up with oscillating motion where the mass bounces up and down. Here's what a graph of its displacement looks like over time:

It's actually a sine wave but that detail doesn't matter for us right now.

An oscillator where the restoring force is proportional to the displacement from the equilibrium point is called a simple harmonic oscillator and its oscillation is always described by a sine wave.

Note that I'm ignoring friction here. This is a reasonable approximation for many physical systems.

Masses on springs aren't all that important in themselves. But simple harmonic oscillators are very common. Another standard example is the pendulum swinging under the influence of gravity:

At a more fundamental level, an example might be an atom in a crystal being held in place by electrostatic forces from its neighbouring atoms.

If you have one of these systems, then in principle you can set it in motion with as little energy as you like. Pull a mass on a spring down a little bit and it will bounce back up, oscillating a certain amount. Pull the mass down half the amount and it'll bounce with oscillations half the size. In principle we could keep repeating this experiment, each time starting with the mass displaced half the amount we tried previously. In other words, a simple harmonic oscillator can have any energy we like. The spectrum of possible energies of one of these oscillators is continuous. (Note that the word spectrum here is merely physicist-speak for a set of possible values.) If we can set one in motion with 1 unit of energy then we can also set it oscillating with 0.5 units, or 0.01 units, or 0.000123 units of energy.

Quantum mechanics

Everything I've said above is assuming that classical Newtonian mechanics is valid. But we know that for very small systems, around the size of a few atoms or smaller, we need to use quantum mechanics. This is an enormous topic but I'm only going to extract one basic fact. According to quantum mechanics, a simple harmonic oscillator isn't free to oscillate with any energy you like. The possible energy levels, the spectrum of the system, is discrete. There is a lowest energy level, and then all of the energy levels above that are equally spaced like so, going up forever:

We usually call the lowest energy level the ground state or vacuum state and call the higher levels excited states.

The spacing of the energy levels depends on the stiffness of the system, which is just a measure of how much the restoring force increases with displacement from equilibrium. Stiffer systems will have a higher frequency of oscillation and a bigger spacing between the energy levels.

(I'm deliberately not saying anything about why we get discrete energy levels in quantum mechanics. I just want to use this one fact so I can get on and talk about photons eventually.)

In practice the difference in energy between one level and the next is tiny. This means that if you're literally fiddling about with a mass on a spring you won't ever feel the discreteness. The amount your hand trembles is many orders of magnitude greater than the effect of this discreteness. Nonetheless, it is extremely important when modeling microscopic systems.

Quantum linguistics

Here are some English sentences we could say about the kinds of systems I've described so far:

This system is in the ground state.
That system is in its first excited state
This system is at an energy level higher than that system
After allowing these two similar oscillators to interact, the energy level of this oscillator went down and the energy level of that one went up by the same amount.

Now I want to introduce the (count) noun quantum, with plural quanta. The idea here is not that I'm telling you about a new entity. I want to present this as a new way to talk about things I've already introduced. So rather than give a definition of quantum I will instead show how you can rewrite the above sentences using the language of quanta:

There are no quanta in this system
That system has one quantum of energy
This system has more quanta than that one
Some quanta were transferred from this system to that system.

Those sentences make it seem like I'm talking about a new kind of object - the quantum. But I'm not. They're just a manner of speaking about energy levels. I hope I've given you enough examples to get the idea.

Just in case you think it's weird to talk about energy levels in terms of quanta, I'd like to remind you that you already do this all the time with money. Dollar bills are actual objects that exist in the world. But money in your bank account isn't. Somewhere in some database is a representation of how much money you have. You might say "I have one hundred dollars in my savings account" But those dollars certainly don't exist as distinct entities. It doesn't really make sense to talk about the thirty-seventh dollar in your bank account. You can transfer dollars from one account to another, and yet what's really happening is that two totals are being adjusted. We treat these accounts a lot like they're containers holding individual objects called dollars. Certainly our language is set up like that. But we know that it's really just the totals that have any kind of representation. The same goes for quanta. It's just a manner of speaking about systems that can have different amounts of energy and where the spectrum of energy levels forms a ladder with equally spaced rungs. Because of your experience with money I probably don't need to give you any more examples.

One more bit of terminology: when the spectrum of energies is discrete it's said to be quantised.

Coupled systems

Let's return to classical physics with a slightly more complex system consisting of two masses connected to springs. We ignore gravity now:

We restrict ourselves to just considering back and forth motion constrained along a horizontal line. This is a coupled system. If the left mass moves to the right, not only does it experience a restoring force pushing it left, but the mass on the right will experience more of a force pushing it to the left. We can't treat the masses as independent and so we don't get the simple solution of each mass always oscillating with a sine wave.

For this particular problem though there's a trick to turn it into a pair of harmonic oscillators. The idea is to consider the pair of masses as a single entity. We can think of the motion centre of mass of the pair, the midpoint between them, as being one variable that describes this entity. Let's call its motion the external motion. We can also think of the distance between the two masses in the pair as being the system's internal motion. (I'm just using internal and external as convenient names. Don't read too much into them.) It turns out that when you analyse this using classical dynamics the internal motion and the external motion act like independent quantities. What's more, each one behaves exactly like it's simple harmonic. So we get one sine wave describing the overall motion of the pair, and another one that describes how the elements of the pair oscillate with respect to each other.

The frequencies of the internal and external motions are typically different. So you can end up with some quite complicated motions with two different frequencies beating against each other.

When we're able to find ways to split up the motion into independent quantities, each of which is simple harmonic, each kind of motion is said to be a normal mode.

When you have independent normal modes, you can treat them independently in quantum mechanics too. So what we get is that the spectrum of possible energy levels for this system is, in some sense, two-dimensional. We can put quanta into the internal oscillation and we can also put quanta into the external oscillation. Because these modes have different frequencies the quanta for each mode correspond to different amounts of energy.

(And a reminder: when talking about quantum mechanics I'm not talking literally about masses on springs. I'm talking about physical systems that have equations of motion that mean they behave like masses on springs. In this case it might be a pair of particles trapped in a microscopic well with a repulsive force between them.)

Solid state physics

Now I'm going to jump from just two masses to a large number of them. For example, the behavior of trillions of atoms in a solid crystal can be approximately modelled by a grid of masses and springs, of which the following diagram is just a tiny piece:

A real crystal would be arranged in a 3D lattice but I've drawn 2D here for convenience.

Think of the springs as both pushing apart atoms that get close, and pulling together atoms that move apart.

This is a highly coupled system. Ultimately every atom in our lattice is connected to every other one, either directly, or indirectly. Nonetheless, it is still possible to find normal modes. The normal modes all have the same basic form: they are all sinusoidal waves of displacement traveling in some direction with some speed and oscillation frequency. Each of these modes consists of waves that extend through the entire crystal, with fixed spacing between parallel planar wavefronts. This type of waves is known as a plane wave. If the system is perfectly harmonic, so the restoring force is precisely proportional to the displacement, then each direction and frequency of wave oscillates its way through the crystal completely independently of any other. Just as how in the example with two masses any possible oscillation is a combination of internal and external motion, for a crystal lattice any motion is a combination of these plane waves. (Decomposing any oscillation as a combination of plane waves is known as computing its Fourier transform.

Now we're ready to consider this situation quantum mechanically. Because each plane wave is a normal mode, we can treat each one as an independent simple harmonic oscillator. This means that the energy in each plane wave is quantised. So when we consider a crystal lattice quantum mechanically we find that its states consist of plane waves propagating through it, but where the amount of energy in each wave is given by a discrete spectrum. So again we can talk about how many quanta there are in each mode.

Linguistically it gets a bit more interesting now. Each plane wave is associated with a particular direction and speed so it makes sense to talk of these quanta as having a direction and speed. But note that statements involving quanta are still really just sentences about energy levels. So, for example, the statement "the mode of this system with this velocity and frequency is in its first excited state" is, by definition, exactly the same as "this system has precisely one quantum with this velocity and frequency". In particular, when we write sentences like these we aren't implying that there is some new kind of object, the quantum, that has suddenly attached itself to our crystal. The quanta are properties of the lattice. By the way, in the particular case of vibrating atoms in a lattice, the quanta are known by a special name: phonons.

Quantum field theory and photons

And now we're ready to move onto photons.

In classical physics, electromagnetism is described by Maxwell's equations. Maxwell's equations say that a varying magnetic field generates an electric field and a varying electric field generates a magnetic field. The result is that it is possible for an oscillating electric field to create an oscillating electric field so that an electric field can propagate through space on its own without the help of electric charges or electric currents or any other kind of `generator'. As these electric fields also produce magnetic fields that propagate with them, the whole thing is called an electromagnetic wave.

Just like displacements in a crystal lattice, an electromagnetic wave also has normal modes. The normal modes are plane waves traveling at the speed of light in a particular directions with a given frequency. You have personal experience of this. Visible light is electromagnetic radiation with a frequency of around 500 THz. Wifi uses signals at around 5 GHz. The radio might use signals at around 100 MHz. When you surf the web wirelessly while listening to the radio, the wifi signals don't interfere with your vision or the radio signal. (Actually, wifi might interfere with the radio signals, but not because of the 5 GHz signals. It might happen if badly manufactured hardware emits stray signals around the 100 MHz band.) That's because these waves pass through each other without being coupled to each other in any way. And at this point you might already be guessing what a photon is. For each choice of frequency and direction (and also polarisation, but that's just a detail) the amount of energy that can be in the corresponding mode is quantised. For the electromagnetic field the quanta are called photons.

And that's it!

Electromagnetic waves can be thought of as being made up of different oscillation modes. Because of quantum mechanics, each mode contains an amount of energy that is quantised to be a whole number multiple of some base amount. Although the thing that really matters is the total amount of energy in the modes, it can still be useful to talk about this total as if it's a collection of entities called photons.

One thing to notice is that the normal modes for an electromagnetic wave are plane waves that are extended in space. In principle all the way across the universe but for practical problems physicists often consider electromagnetic waves in a large but finite box. This means that adding a quantum to a system has an effect that extends across the entire system. That makes it problematic to talk about the location of a photon.

Caveat

Physicists sometimes use the word photon in slightly different but related ways. I've described what I think of as the core definition as presented in many courses on quantum field theory.

Acknowledgements

Thanks to @dmoore2718 for encouraging me to edit this document down to a better size.

Self-referential logic via self-referential circuits

2017-07-15T09:09:00.000-07:00

Introduction

TL;DR The behaviour of a certain kind of delay component has a formal similarity to Löb's theorem which gives a way to embed part of provability logic into electronic circuits.

Here's a famous paradoxical sentence:

This sentence is false

If it's false then it's true and if it's true then it's false.

Here's a paradoxical electronic circuit:

The component in the middle is an inverter. If the output of the circuit is high then its input is high and then it's output must be low, and vice versa.

There's a similarity here. But with a bit of tweaking you can turn the similarity into an isomorphism of sorts.

In the first case we avoid paradox by noting that in the mathematical frameworks commonly used by mathematicians it's impossible, in general, for a statement to assert it's own falsity. Instead, a statement can assert its own unprovability and then we get Gödel's incompleteness theorems and a statement that is apparently true and yet can't be proved.

In the second case we can't model the circuit straightforwardly as a digital circuit. In practice it might settle down to a voltage that lies between the official high and low voltages so we have to model it as an analogue circuit. Or instead we can introduce a clock and arrange that the feedback in the circuit is delayed. We then get an oscillator circuit that can be thought of as outputting a stream of bits.

The observation I want to make is that if the feedback delay is defined appropriately, these two scenarios are in some sense isomorphic. This means that we can model classic results about provability, like Gödel's incompleteness theorems, using electronic circuits. We can even use such circuits to investigate what happens when logicians or robots play games like Prisoner's Dilemma. I'll be making use of results found in Boolos' book on The Logic of Provability and some ideas I borrowed from Smoryński's paper on Fixed Point Algebras. I'll be assuming the reader has at least a slight acquaintance with ithe ideas behind provability logic.

Provability Logic

There are many descriptions of provability logic (aka GL) available online, so I'm not going to repeat it all here. However, I've put some background material in the appendix below and I'm going to give a very brief reminder now.

Start with (classical) propositional calculus which has a bunch of variables with names like and connectives like for AND, for OR, for NOT and for implication. (Note that .)

Provability logic extends propositional calculus by adding a unary operator . (I apologise, that's meant to be a □ but it's coming out like in LaTeX formulae. I think it's a bug in Google's LaTeX renderer.) The idea is that asserts that is provable in Peano Arithmetic, aka PA. In addition to the axioms of propositional calculus we have

and

as well as a rule that allows us to deduce from .

We also have this fixed point property:

Let be any predicate we can write in the language of GL involving the variable , and suppose that every appearance of in is inside a , e.g. . Then there is a fixed point, i.e. a proposition that makes no mention of such that is a theorem. In effect, for any such , is a proposition that asserts .

See the appendix for a brief mention of why we should expect this to be true.

From the fixed point property we can deduce Löb's theorem: . There is a proof at wikipedia that starts from the fixed point property.

We can also deduce the fixed point property from Löb's theorem so it's more usual to take Löb's theorem as an axiom of GL and show that the fixed point property follows. You can think of Löb's theorem as a cunning way to encode the fixed point property. In fact you can argue that it's a sort of Y-combinator, the function that allows the formation of recursive fixed points in functional programming languages. (That's also, sort of, the role played by the loeb function I defined way back. But note that loeb isn't really a proof of Löb's theorem, it just has formal similarities.)

Back to electronic circuits

In order to make digital circuits with feedback loops well-behaved I could introduce a circuit element that results in a delay of one clock cycle. If you insert one of these into the inverter circuit I started with you'll end up with an oscillator that flips back and forth between 0 and 1 on each clock cycle. But I want to work with something slightly stricter. I'd like my circuits to eventually stop oscillating. (I have an ulterior motive for studying these.) Let me introduce this component:

It is intended to serve as a delayed latch and I'll always have the flow of data being from left to right. The idea is that when it is switched on it outputs 1. It keeps outputting 1 until it sees a 0 input. When that happens, then on the next clock cycle its output drops to 0 and never goes back up to 1 until reset.

Because the output of our delay-latch isn't a function of its current input, we can't simply describe its operation as a mathematical function from to . Instead let's think of electronic components as binary operators on bitstreams, i.e. infinite streams of binary digits like ...00111010 with the digits emerging over time starting with the one written on the right and working leftwards. The ordinary logic gates perform bitwise operations which I'll represent using the operators in the C programming language. For example,

...001110 & ...101010 = ...001010

and

~...101 = ...010

and so on. Let's use □ to represent the effect of latch-delay on a bitstream. We have, for example,

□...000 = ...001

and

□...11101111 = ...00011111.

The operator □ takes the (possibly empty) contiguous sequence of 1's at the end of the bitstream, extends it by one 1, and sets everything further to the left to 0. If we restrict ourselves to bitstreams that eventually become all 0's or all 1's on the left, then bitstreams are in one-to-one correspondence with the integers using the twos complement representation. For example ...111111, all 1's, represents the number -1. I'll simply call the bistreams that represent integers integers. With this restriction we can use a classic C hacker trick to write □p=p^(p+1) where ^ is the C XOR operator. The operator □ outputs the bits that get flipped when you add one.

Let's use the symbol → so that a → b is shorthand for ~a|b. Here are some properties of □:

1. □(-1) = -1

2. □p → □□p = -1

3. □(p → q) → □p → □q = -1

In addition we have the fixed point property:

Let F(p) be any function of p we can write using □ and the bitwise logical operators and such that all occurrences of p occur inside □. Then there is a unique bitstream q such that q=F(q).

We can make this clearer if we return to circuits. F(p) can be thought of as a circuit that takes p as input and outputs some value. We build the circuit using only boolean logic gates and delay-latch. We allow feedback loops, but only ones that go through delay-latches. With these restrictions it's pretty clear that the circuit is well-behaved and deterministically outputs a bitstream.

We also have the Löb property:

4. □(□p → p) → □p = -1

We can see this by examining the definition of □. Intuitively it says something like "once □ has seen a 0 input then no amount of setting input bits to 1 later in the stream make any different to its output".

I hope you've noticed something curious. These properties are extremely close to the properties of in GL. In fact, these electronic circuits form a model of the part of GL that doesn't involve variable names, i.e. what's known as letterless GL. We can formalise this:

1. Map to a wire set to 0, which outputs ...000 = 0.

2. Map to a wire set to 1, which outputs ...111 = -1.

3. Map , where is a binary connective, by creating a circuit that takes the outputs from the circuits for and and passes them into the corresponding boolean logic gate.

4. Map to the circuit for piped through a delay-latch.

For example, let's convert into a circuit. I'm translating to the circuit for .

I'm using red wires to mean wires carrying the value 1 rather than 0. I hope you can see that this circuit eventually settles into a state that outputs nothing but 1s.

We have this neat result:

Because delay-latch satisfies the same equations as in provability logic, any theorem, translated into a circuit, will produce a bistream of just 1s, i.e. -1.

But here's a more surprising result: the converse is true.

If the circuit corresponding to a letterless GL proposition produces a bistream of just 1s then the proposition is actually a theorem of GL.

I'm not going to prove this. (It's actually a disguised form of lemma 7.4 on p.95 of Boolos' book.) In the pictured example we got ...1111, so the circuit represents a theorem. As it represents Löb's theorem for the special case we should hope so. More generally, any bitstream that represents an integer can be converted back into a proposition that is equivalent to the original proposition. This means that bitstreams faithfully represent propositions of letterless GL. I'm not going to give the translation here but it's effectively given in Chapter 7 of Boolos. I'll use to represent the translation from propositions to bitstreams via circuits that I described above. Use to represent the translation of bitstream back into propositions. We have . But I haven't given a full description of and I haven't proved here that it has this property.

Circuits with feedback

In the previous section I considered letterless propositions of GL. When these are translated into circuits they don't have feedback loops. But we can also "solve equations" in GL using circuits with feedback. The GL fixed point theorem above says that we can "solve" the equation , with one letter , to produce a letterless proposition such that . Note here that is a letter in the language of GL. But I'm using to represent a proposition in letterless GL. If we build a circuit to represent , and feed its output back into where appears, then the output bitstream represents the fixed point. Here's a translation of the equation :

I'll let you try to convince yourself that such circuits always eventually output all 0's or all 1's. When we run the circuit we get the output ...1111000 = -8. As this is not -1 we know that the fixed point isn't a theorem. If I'd defined above you could use it to turn the bitstream back into a proposition.

The same, syntactically (optional section)

I have a Haskell library on github for working with GL: provability. This uses a syntactic approach and checks propositions for theoremhood using a tableau method. We can use it to analyse the above example with feedback. I have implemented a function, currently called value', to perform the evaluation of the bitstream for a proposition. However, in this case the fixedpoint function computes the fixed point proposition first and then converts to a bitstream rather than computing the bitstream directly from the circuit for F:

> let f p = Neg (Box p \/ Box (Box (Box p)))
> let Just p = fixedpoint f
> p
Dia T /\ Dia (Dia T /\ Dia (Dia T /\ Dia T))
> value' p
-8

(Note that Dia p means .)

The function fixedpoint does a lot of work under the hood. (It uses a tableau method to carry out Craig interpolation.) The circuit approach requires far less work.

Applications

1. Programs that reason about themselves

In principle we can write a program that enumerates all theorems of PA. That means we can use a quine trick to write a computer program that searches for a proof, in PA, of its own termination. Does such a program terminate?

We can answer this with Löb's theorem. Let "The program terminates". The program terminates if it can prove its termination. Formally this means we assume . Using one of the derivation rules of GL we get . Löb's theorem now gives us . Feed that back into our original hypothesis and we get . In other words, we deduce that our program does in fact terminate. (Thanks to Sridhar Ramesh for pointing this out to me.)

But we can deduce this using a circuit. We want a solution to . Here's the corresponding circuit:

It starts by outputting 1's and doesn't stop. In other words, the fixed point is a theorem. And that tells us is a theorem. And hence that the program terminates.

2. Robots who reason about each others play in Prisoner's Dilemma

For the background to this problem see Robust Cooperation in the Prisoner's Dilemma at LessWrong. We have two robot participants and playing Prisoner's Dilemma. Each can examine the other's source code and can search for proofs that the opponent will cooperate. Suppose each robot is programmed to enumerate all proofs of PA and cooperate if it finds a proof that its opponent will cooperate. Here we have "A will cooperate" and "B will cooperate". Our assumptions about the behaviour of the robots are and , and hence that . This corresponds to the circuit:

This outputs ...1111 = -1 so we can conclude and hence that these programs will cooperate. (Note that this doesn't work out nicely if robot B has a program that doesn't terminate but whose termination isn't provable in the formal system A is using. That means this approach is only good for robots that want to cooperate and want to confirm such cooperation. See the paper for more on this.)

At this point I really must emphasise that these applications are deceptively simple. I've shown how these simple circuits can answer some tricky problems about provability. But these aren't simply the usual translations from boolean algebra to logic gates. They work because circuits with delay-latch provide a model for letterless provability logic and that's only the case because of a lot of non-trivial theorem proving in Boolos that I haven't reproduced here. You're only allowed to use these simple circuits once you've seen the real proofs :-)

Things I didn't say above

1. I described the translation from propositions to circuits that I called above. But I didn't tell you what looks like. I'll leave this as an exercise. (Hint: consider the output from the translation of into a circuit.)

2. The integers, considered as bistreams, with the bitwise operators, and the unary operator □p=p^(p+1), form an algebraic structure. For example, if we define ⋄p=~□~p we have a Magari algebra. Structures like these are intended to capture the essential parts of self-referential arguments in an algebraic way.

3. Because of the interpretation of □ as a delayed latch in a circuit you could view it as saying "my input was always true until a moment ago". This surely embeds provability logic in a temporal logic of some sort.

4. (Deleted speculations about tit-for-tat that need rethinking.)

5. For even the most complex letterless proposition in Boolos you could check its theoremhood with a pretty small circuit. You could even consider doing this with a steam powered pneumatic circuit. I had to say that to fulfil a prophecy and maintain the integrity of the timeline.

Appendix on provability

The modern notion of a proof is that it is a string of symbols generated from some initial strings called "axioms" and some derivation rules that make new strings from both axioms and strings you've derived previously. Usually we pick axioms that represent "self-evident" truths and we pick derivation rules that are "truth-preserving" so that every proof ends at a true proposition of which it is a proof. The derivation rules are mechanical in nature: things like "if you have this symbol here and that symbol there then you can replace this symbol with that string you derived earlier" etc.

You can represent strings of symbols using numbers, so-called Gödel numbers. Let's pick a minimal mathematical framework for working with numbers: Peano Arithmetic, aka PA. Let's assume we've made some choice of Gödel numbering scheme and when is a proposition, write for the number representing . You can represent the mechanical derivation rules as operations on numbers. And that makes it possible to define a mathematical predicate that is true if and only if its argument represents a provable proposition.

In other words, we can prove using PA if and only if is a proposition provable in PA.

The predicate has some useful properties:

1.If we can prove , then we can prove .

We take the steps we used to prove , and convert everything to propositions about numbers. If is defined correctly then we can convert that sequence of numbers into a sequence of propositions about those numbers that makes up a proof of .

2. and imply

A fundamental step in any proof is modus ponens, i.e. that and implies . If does its job correctly then it had better know about this.

3. implies

One way is to prove this is to use Löb's theorem.

The trivially true statement had better be provable or is broken.

Constructing is conceptually straightforward but hard work. I'm definitely not going to do it here.

And there's one last thing we need: self-reference. If is a proposition, how can we possibly assert without squeezing a copy of inside ? I'm not going to do that here either - just mention that we can use a variation of quining to achieve this. That allows us to form a proposition for which we can prove . In fact, we can go further. We can find propositions that solve for any predicate built from the usual boolean operations and as long as all of the occurrences of are inside the appearances of . Even though we can't form a proposition that directly asserts its own falsity, we can form one that asserts that it is unprovable, or one that asserts that you can't prove that you can't prove that you can prove it, or anything along those lines.

Anyway, all that and business is a lot of hassle. Provability logic, also known as GL, is intended to capture specifically the parts of PA that relate to provability. GL is propositional calculus extended with the provability operator . The intention is that if is a proposition, is a proposition in GL that represents in PA. The properties of above become the axioms and derivation rules of GL in the main text.

A relaxation technique

2017-06-06T20:32:00.000-07:00

Introduction

Sometimes you want to differentiate the expected value of something. I've written about some tools that can help with this. For example you can use Automatic Differentiation for the derivative part and probability monads for the expectation. But the probability monad I described in that article computes the complete probability distribution for your problem. Frequently this is intractably large. Instead people often use Monte Carlo methods. They'll compute the "something" many times, substituting pseudo-random numbers for the random variables, and then average the results. This provides an estimate of the expected value and is ubiquitous in many branches of computer science. For example it's the basis of ray-tracing and path-tracing algorithms in 3D rendering, and plays a major role in machine learning when used in the form of stochastic gradient descent.

But there's a catch. Suppose we want to compute where each of the belong to the Bernoulli distribution . I.e. each has a probability of being 1 and probability of being 0. If we compute this using a Monte Carlo approach we'll repeatedly generate pseudo-random numbers for each of the . Each one will be 0 or 1. This means that our estimate depends on via subexpressions that can't meaningfully be differentiated with respect to . So how can we use automatic differentiation with the Monte Carlo method? I'm proposing an approach that may or may not already be in the literature. Whether it is or not, I think it's fun to get there by combining many of the things I've previously talked about here, such as free monads, negative probabilities and automatic differentiation. I'm going to assume you're familiar with using dual numbers to compute derivatives as I've written about this before and wikipedia has the basics.

A probability monad

I want to play with a number of different approaches to using monads with probability theory. Rather than define lots of monads I think that the easiest thing is to simply work with one free monad and then provide different interpreters for it.

First some imports:

> import Control.Monad
> import qualified System.Random as R
> import qualified Data.Map.Strict as M

I'm going to use a minimal free monad that effectively gives us a DSL with a new function that allows us to talk about random Bernoulli variables:

> data Random p a = Pure a | Bernoulli p (Int -> Random p a)

The idea is that Pure a represents the value a and Bernoulli p f is used to say "if we had a random value x, f x is the value we're interested in". The Random type isn't going to do anything other than represent these kinds of expressions. There's no implication that we actually have a random value for x yet.

> instance Functor (Random p) where
>     fmap f (Pure a) = Pure (f a)
>     fmap f (Bernoulli p g) = Bernoulli p (fmap f . g)


> instance Applicative (Random p) where
>     pure = return
>     (<*>) = ap


> instance Monad (Random p) where
>     return = Pure
>     Pure a >>= f = f a
>     Bernoulli p g >>= f = Bernoulli p (\x -> g x >>= f)

We'll use bernoulli p to represent a random Bernoulli variable drawn from .

> bernoulli :: p -> Random p Int
> bernoulli p = Bernoulli p return

So let's write our first random expression:

> test1 :: Random Float Float
> test1 = do
>     xs <- replicateM 4 (bernoulli 0.75)
>     return $ fromIntegral $ sum xs

It sums 4 Bernoulli random variables from and converts the result to a Float. The expected value is 3.

We don't yet have a way to do anything with this expression. So let's write an interpreter that can substitute pseudo-random values for each occurrence of bernoulli p:

It's essentially interpreting our free monad as a state monad where the state is the random number seed:

> interpret1 :: (Ord p, R.Random p, R.RandomGen g) => Random p a -> g -> (a, g)
> interpret1 (Pure a) seed = (a, seed)
> interpret1 (Bernoulli prob f) seed = 
>     let (r, seed') = R.random seed
>         b       = if r <= prob then 1 else 0
>     in interpret1 (f b) seed'

You can use the expression R.getStdRandom (interpret1 test1) if you want to generate some random samples for yourself.

We're interested in the expected value, so here's a function to compute that:

> expect1 :: (Fractional p, Ord p, R.Random p, R.RandomGen g) => Random p p -> Int -> g -> (p, g)
> expect1 r n g = 
>     let (x, g') = sum1 0 r n g
>     in (x/fromIntegral n, g')


> sum1 :: (Ord p, Num p, R.Random p, R.RandomGen g) => p -> Random p p -> Int -> g -> (p, g)
> sum1 t r 0 g = (t, g)
> sum1 t r n g =
>     let (a, g') = interpret1 r g
>     in sum1 (t+a) r (n-1) g'

You can test it out with R.getStdRandom (expect1 test1 1000). You should get values around 3.

We can try completely different semantics for Random. This time we compute the entire probability distribution:

> interpret2 :: (Num p) => Random p a -> [(a, p)]
> interpret2 (Pure a) = [(a, 1)]
> interpret2 (Bernoulli p f) =
>     scale p (interpret2 (f 1)) ++ scale (1-p) (interpret2 (f 0))


> scale :: Num p => p -> [(a, p)] -> [(a, p)]
> scale s = map (\(a, p) -> (a, s*p))

You can try it with interpret2 test1.

Unfortunately, as it stands it doesn't collect together multiple occurrences of the same value. We can do that with this function:

> collect :: (Ord a, Num b) => [(a, b)] -> [(a, b)]
> collect = M.toList . M.fromListWith (+)

And now you can use collect (interpret2 test1).

Let's compute some expected values:

> expect2 :: (Num p) => Random p p -> p
> expect2 r = sum $ map (uncurry (*)) (interpret2 r)

The value of expect2 test1 should be exactly 3. One nice thing about interpret2 is that it is differentiable with respect to the Bernoulli parameter when this is meaningful. Unfortunately it has one very big catch: the value of interpret2 can be a very long list. Even a small simulation can results in lists too big to store in the known universe. But interpret1 doesn't produce differentiable results. Is there something in-between these two interpreters?

Importance sampling

Frequently in Monte Carlo sampling it isn't convenient to sample from the distribution you want. For example it might be intractably hard to do so, or you might have proven that the resulting estimate has a high variance. So instead you can sample from a different, but possibly related distribution. This is known as importance sampling. Whenever you do this you must keep track of how "wrong" your probability was and patch up your expectation estimate at the end. For example, suppose a coin comes up heads 3/4 of the time. Instead of simulating a coin toss that comes up 3/4 of the time you could simulate one that comes up heads half of the time. Suppose at one point in the simulation it does come up heads. Then you used a probability of 1/2 when you should have used 3/4. So when you compute the expectation you need to scale the contribution from this sample by (3/4)/(1/2) = 3/2. You need so scale appropriately for every random variable used. A straightforward way to see this for the case of a single Bernoulli variable is to note that

.

We've replaced probabilities and with and but we had to scale appropriately in each of the cases and to keep the final value the same. I'm going to call the scale value the importance. If we generate random numbers in a row we need to multiply all of the importance values that we generate. This is a perfect job for the Writer monad using the Product monoid. (See Eric Kidd's paper for some discussion about the connection between Writer and importance sampling.) However I'm just going to write an explicit interpreter for our free monad to make it clear what's going where.

This interpreter is going to take an additional argument as input. It'll be a rule saying what probability we should sample with when handling a variable drawn from . The probability should be a real number in the interval .

> interpret3 :: (Fractional p, R.RandomGen g) =>
>               (p -> Float) -> Random p a -> g -> ((a, p), g)
> interpret3 rule (Pure a) g = ((a, 1), g)
> interpret3 rule (Bernoulli p f) g = 
>     let (r, g') = R.random g
>         prob = rule p
>         (b, i)  = if (r :: Float) <= prob
>           then (1, p/realToFrac prob)
>           else (0, (1-p)/realToFrac (1-prob))
>         ((a, i'), g'') = interpret3 rule (f b) g'
>     in ((a, i*i'), g'')

Here's the accompanying code for the expectation:

> expect3 :: (Fractional p, R.RandomGen g) =>
>            (p -> Float) -> Random p p -> Int -> g -> (p, g)
> expect3 rule r n g = 
>     let (x, g') = sum3 rule 0 r n g
>     in (x/fromIntegral n, g')


> sum3 :: (Fractional p, R.RandomGen g) =>
>         (p -> Float) -> p -> Random p p -> Int -> g -> (p, g)
> sum3 rule t r 0 g = (t, g)
> sum3 rule t r n g =
>     let ((a, imp), g') = interpret3 rule r g
>     in sum3 rule (t+a*imp) r (n-1) g'

For example, you can estimate the expectation of test1 using unbiased coin tosses by evaluating R.getStdRandom (expect3 (const 0.5) test1 1000).

Generalising probability

Did you notice I made my code slightly more general than seems to be needed? Although I use probabilities of type Float to generate my Bernoulli samples, the argument to the function bernoulli can be of a more general type. This means that we can use importance sampling to compute expected values for generalised measures that take values in a more general algebraic structure than the interval [0,1]. For example, we could use negative probabilities. An Operational Interpretation of Negative Probabilities and No-Signalling Models by Adamsky and Brandenberger give a way to interpret expressions involving negative probabilities. We can implement it using interpret3 and the rule \p -> abs p/(abs p+abs (1-p)). Note that it is guaranteed to produce values in the range [0,1] (if you start with dual numbers with real parts that are ordinary probabilities) and reproduces the usual behaviour when given ordinary probabilities.

Here's a simple expression using a sample from "":

> test2 = do
>     a <- bernoulli 2
>     return $ if a==1 then 2.0 else 1.0

It's expected value is 3. We can get this exactly using expect2 test2. For a Monte Carlo estimate use

R.getStdRandom (expect3 (\back p -> abs p/(abs p+abs (1-p))) test2 1000)

Note that estimates involving negative probabilities can have quite high variances so try a few times until you get something close to 3 :-)

We don't have to stick with real numbers. We can use this approach to estimate with complex probabilities (aka quantum mechanics) or other algebraic structures.

Discrete yet differentiable

And now comes the trick: automatic differentiation uses the algebra of dual numbers. It's not obvious at all what a probability like means when is infinitesimal. However, we can use interpret3 to give it meaningful semantics.

Let'd define the duals in the usual way first:

> data Dual a = D { real :: a, infinitesimal :: a }


> instance (Ord a, Num a) => Num (Dual a) where
>     D a b + D a' b' = D (a+a') (b+b')
>     D a b * D a' b' = D (a*a') (a*b'+a'*b)
>     negate (D a b) = D (negate a) (negate b)
>     abs (D a b) = if a > 0 then D a b else D (-a) (-b)
>     signum (D a b) = D (signum a) 0
>     fromInteger a = D (fromInteger a) 0


> instance (Ord a, Fractional a) => Fractional (Dual a) where
>     fromRational a = D (fromRational a) 0
>     recip (D a b) = let ia = 1/a in D ia (-b*ia*ia)


> instance Show a => Show (Dual a) where
>     show (D a b) = show a ++ "[" ++ show b ++ "]"

Now we can use the rule real to give as a real-valued probability from a dual number. The function expect3 will push the infinitesimal part into the importance value so it doesn't get forgotten about. And now expect3 gives us an estimate that is differentiable despite the fact that our random variables are discrete.

Let's try an expression:

> test3 p = do
>     a <- bernoulli p
>     b <- bernoulli p
>     return $ if a == 1 && b == 1 then 1.0 else 0.0

The expected value is and the derivative is . We can evaluate at with expect2 (test3 (D 0.5 1)). And we can estimate it with

R.getStdRandom (expect3 real (test4 (D 0.5 1)) 1000)

What's neat is that we can parameterise our distributions in a more complex way and we can freely mix with conventional expressions in our parameter. Here's an example:

> test4 p = do
>     a <- bernoulli p
>     b <- bernoulli (p*p)
>     return $ p*fromIntegral a*fromIntegral b

Try evaluating expect2 (test4 (D 0.5 1)) and

R.getStdRandom (expect3 real (test4 (D 0.5 1)) 1000)

I've collected the above examples together here:

> main = do
>     print =<< R.getStdRandom (interpret1 test1)
>     print $ collect $ interpret2 test1
>     print =<< R.getStdRandom (expect1 test1 1000)
>     print (expect2 test1)
>     print =<< R.getStdRandom (expect3 id test1 1000)
>     print =<< R.getStdRandom (expect3 (const 0.5) test1 1000)
>     print "---"
>     print $ expect2 test2
>     print =<< R.getStdRandom (expect3 (\p -> abs p/(abs p+abs (1-p))) test2 1000)
>     print "---"
>     print $ expect2 (test3 (D 0.5 1))
>     print =<< R.getStdRandom (expect3 real (test3 (D 0.5 1)) 1000)
>     print "---"
>     print $ expect2 (test4 (D 0.5 1))
>     print =<< R.getStdRandom (expect3 real (test4 (D 0.5 1)) 1000)

What just happened?

You can think of a dual number as a real number that has been infinitesimally slightly deformed. To differentiate something we need to deform something. But we can't deform 0 or 1 and have them stay 0 or 1. So the trick is to embed probability sampling in something "bigger", namely importance sampling, where samples carry around an importance value. This bigger thing does allow infinitesimal deformations. And that allows differentiation. This process of turning something discrete into something continuously "deformable" is generally called relaxation.

Implementation details

I've made no attempt to make my code fast. However I don't think there's anything about this approach that's incompatible with performance. There's no need to use a monad. Instead you can track the importance value through your code by hand and implement everything in C. Additionally, I've previously written about the fact that for any trick involving forward mode AD there is another corresponding trick you can use with reverse mode AD. So this method is perfectly comptible with back-propagation. Note also that the dual number importances always have real part 1 which means you don't actually need to store them.

The bad news is that the derivative estimate can sometimes have a high variance. Nonetheless, I've used it successfully for some toy optimisation problems. I don't know if this approach is effective for industrial strength problems. Your mileage may vary :-)

Alternatives

Sometimes you may find that it is acceptable to deform the samples from your discrete distribution. In that case you can use the concrete relaxation.

Continuous variables

The above method can be adapted to work with continuous variables. There is a non-trivial step which I'll leave as an exercise but I've tested it in some Python code. I think it reproduces a standard technique and it gives an alternative way to think about that trick. That article is also useful for ways to deal with the variance issues. Note also that importance sampling is normally used itself as a variance reduction technique. So there are probably helpful ways to modify the rule argument to interpret3 to simultaneously estimate derivatives and keep the variance low.

Personal note

I've thought about this problem a couple of times over the years. Each time I've ended up thinking "there's no easy way to extend AD to work with random variables so don't waste any more time thinking about it". So don't listen to anything I say. Also, I like that this method sort of comes "for free" once you combine methods I've described previously.

Acknowledgements

I think it was Eric Kidd's paper on building probability monads that first brought to my attention that there are many kinds of semantics you can use with probability theory - i.e. there are many interpreters you can write for the Random monad. I think there is an interesting design space worth exploring here.

Answer to exercise

I set the continuous case as an exercise above. Here is a solution.

Suppose you're sampling from a distribution parameterised by with pdf . To compute the derivative with respect to you need to consider sampling from where is an infinitesimal.

.

As we don't know how to sample from a pdf with infinitesimals in it, we instead sample using as usual, but use an importance of

The coefficient of the gives the derivative. So we need to compute the expectation, scaling each sample with this coefficient. In other words, to estimate we use

where the are drawn from the original distribution. This is exactly what is described at Shakir Mohamed's blog.

Final word

I managed to find the method in the literature. It's part of the REINFORCE method. For example, see equation (5) there.

Logarithms and exponentials of functions

2017-02-05T10:30:00.000-08:00

Introduction

A popular question in mathematics is this: given a function , what is its "square root" in the sense that . There are many questions about this on mathoverflow but it's also a popular subject in mathematics forums for non-experts. This question seems to have a certain amount of notoriety because it's easy to ask but hard to answer fully. I want to look at an approach that works nicely for formal power series, following from the Haskell code I wrote here. There are some methods for directly finding "functional square roots" for formal power series that start as , but I want to approach the problem indirectly. When working with real numbers we can find square roots, say, by using . I want to use an analogue of this for functions. So my goal is to make sense of the idea of the logarithm and exponential of a formal power series as composable functions. Warning: the arguments are all going to be informal.

Notation

There's potential for a lot of ambiguous notation here, especially as the usual mathematical notation for th powers of trig functions is so misleading. I'm going to use for composition of functions and power series, and I'm going to use the notation to mean the th iterate of . So and . As I'll be working mostly in the ring of formal power series for some ring , I'll reserve the variable to refer only to the corresponding element in this ring. I'll also use formal power series somewhat interchangeably with functions. So can be thought of as representing the identity function. To make sure we're on the same page, here are some small theorems in this notation:

That last one simply says that adding one times is the same as adding .

As I'm going to have ordinary logarithms and exponentials sitting around, as well as functional logarithms and exponentials, I'm going to introduce the notation for functional logarithm and for functional exponentiation.

Preliminaries

The first goal is to define a non-trivial function with the fundamental property that

First, let's note some basic algebraic facts. The formal power series form a commutative ring with operations and (ordinary multiplication) and with additive identity and multiplicative identity . The formal power series form a ring-like algebraic structure with operation and partial operation with additive identity and multiplicative identity . But it's not actually ring or even a near-ring. Composition isn't defined for all formal power series and even when it's defined, we don't have distributivity. For example, in general , after all there's no reason to expect to equal . We do have right-distributivity however, i.e.

,

because

,

more or less by definition of .

We can't use power series on our power series

There's an obvious approach, just use power series of power series. So we might tentatively suggest that

.

Note that I consider rather than because is the multiplicative identity in our ring-like structure.

Unfortunately this doesn't work. The reason is this: if we try to use standard reasoning to show that the resulting function has the fundamental property we seek we end up using distributivity. We don't have distributivity.

Sleight of hand

There's a beautiful trick I spotted on mathoverflow recently that allows us to bring back distributivity. (I can't find the trick again, but when I do I'll come back and add a link and credit here.) Consider the function defined by . In other words is right-composition by . (Ambiguity alert, I'm using here to mean right. It has nothing to do with the ring underlying our formal power series.) Because we have right-distributivity, is a bona fide linear operator on the space of formal power series. If you think of formal power series as being infinitely long vectors of coefficients then can be thought of as an infinitely sized matrix. This means that as long as we have convergence, we can get away with using power series to compute with the property that . Define:

.

We have:

where I'm using to mean the identity linear operator. And now have:

.

But does it converge? Suppose is of the form . Then . The leading term in is the same as the leading term in . So kills the first term of whatever it is applied to, which means that when we sum the terms in , we only need to get a power series correct to coefficients. Reusing my code from here, I call by the name flog. Here is its implementation:

> import Data.Ratio


> flog :: (Eq a, Fractional a) => [a] -> [a]
> flog f@(0 : 1 : _) =
>   flog' 1 (repeat 0) (0 : 1 : repeat 0)
>      where flog' n total term = take (n+1) total ++ (
>              drop (n+1) $
>                 let pz = p term
>                 in flog' (n+1) (total-map (((-1)^n / fromIntegral n) *) pz) pz)
>            p total = (total ○ f) - total

The take and drop are how I tell Haskell when the first coefficients have been exactly computed and so no more terms are necessary.

Does it work?

Here's an example using the twice iterated sin function:

> ex1 = do
>   let lhs = flog (sin (sin z))
>   let rhs = 2*flog (sin z)
>   mapM_ print $ take 20 (lhs-rhs)

Works to 20 coefficients. Dare we try an inverse function?

> ex2 = do
>   let lhs = flog (sin z)
>   let rhs = flog (asin z)
>   mapM_ print $ take 20 (lhs+rhs)

Seems to work!

Exponentials

It's no good having logarithms if we can't invert them. One way to think about the exponential function is that

We get better and better approximations by writing the expression inside the limit as a product of more and more terms. We can derive the usual power series for from this, but only if right-distributivity holds. So let's try to use the above expression directly:

and get

.

Unfortunately, even though is linear, itself isn't. So it's going to take some extra work to raise to the power of .

The good news is that we're dealing with the special case where is something small. We have

.

So is actually modulo higher order terms. This gives us

.

This is something we can implement using the power series for ordinary :

.

In code that becomes:

> fexp f@(0 : 0 : _) = fexp' f 0 z 1
> fexp' f total term n = take (n-1) total ++ drop (n-1)
>           (fexp' f (total+term) (map (/fromIntegral n) (f*d term)) (n+1))

Note how when we differentiate a power series we shift the coefficients down by one place. To counter the effect of that so as to ensure convergence we need to look like . Luckily this is exactly the kind of series gives us.

But does it successfully invert ? Let's try:

> ex3 = do
>   let lhs = sin z
>   let rhs = fexp (flog (sin z))
>   mapM_ print $ take 20 (lhs-rhs)

Now we can start computing fractional iterates. Square root first:

> ex4 = do
>   mapM_ print $ take 20 $ fexp (flog (sin z)/2)

That matches the results at A048602 and A048603.

Cube root:

> ex5 = do
>   mapM_ print $ take 20 $ fexp (flog (sin z)/3)

Matches A052132 and A052135.

And this gives an alternative to Lagrange inversion for computing power series for inverse functions:

> ex6 = do
>   let lhs = fexp (-flog (sin z))
>   let rhs = asin z
>   mapM_ print $ take 20 (lhs-rhs)

What's really going on with ?

Let's approach in a slightly different way. In effect, is the composition of lots of with . So let's try composing these one at a time, with one composition every seconds. After one second we should have our final result. We can write this as:

and to first order.

So we're solving the differential equation:

and

with .

So is the function that solves one of the most fundamental differential equations. This also means I can use Mathematica to solve symbolically and check my results. For example, Mathematica says that the solution to

and

at is

so let's check:

> ex7 = do
>   let lhs = fexp ((sin z)^2)
>   let rhs = atan (tan z/(1-tan z))
>   mapM_ print $ take 20 (lhs-rhs)

I like this example because it leads to the generalized Catalan numbers A004148:

> ex8 = do
>     mapM_ print $ take 20 $ fexp (z^2/(1-z^2))

That suggests this question: what does mean combinatorially? I don't have a straightforward answer but solving this class of differential equation motivated the original introduction, by Cayley, of the abstract notion of a tree. See here.

What is going on geometrically?

For those who know some differential geometry, The differential equation

and

describes a flow on the real line (or complex plane). You can think of as being a one-dimensional vector field describing how points move from time to . When we solve the differential equation we get integral curves that these points follow and tells us where the points end up after one unit of time. So is the exponential map. In fact, is essentially the exponential of the vector field where we're now using the differential geometer's notion of a vector field as a differential operator.

Final word

Unfortunately the power series you get from using and don't always have good convergence properties. For example, I'm not sure but I think the series for has radius of convergence zero. If you truncate the series you get a half-decent approximaion to a square root in the vicinity of the origin, but the approximation gets worse, not better, if you use more terms.

And the rest of the code

> (*!) _ 0 = 0
> (*!) a b = a*b
> (!*) 0 _ = 0
> (!*) a b = a*b
> (^+) a b = zipWith (+) a b
> (^-) a b = zipWith (-) a b


> ~(a:as) ⊗ (b:bs) = (a *! b):
>     ((map (a !*) bs) ^+ (as ⊗ (b:bs)))
> (○) (f:fs) (0:gs) = f:(gs ⊗ (fs ○ (0:gs)))
> inverse (0:f:fs) = x where x     = map (recip f *) (0:1:g)
>                            _:_:g    = map negate ((0:0:fs) ○ x)
> invert x = r where r = map (/x0)  ((1:repeat 0) ^- (r ⊗ (0:xs)))
>                    x0:xs = x 


> (^/) (0:a) (0:b) = a ^/ b
> (^/) a b = a ⊗ (invert b)


> z :: [Rational]
> z = 0:1:repeat 0


> d (_:x) = zipWith (*) (map fromInteger [1..]) x


> integrate x = 0 : zipWith (/) x (map fromInteger [1..])


> instance (Eq r, Num r) => Num [r] where
>     x+y  = zipWith (+) x y
>     x-y  = zipWith (-) x y
>     ~x*y = x ⊗ y
>     fromInteger x      = fromInteger x:repeat 0
>     negate x     = map negate x
>     signum (x:_) = signum x : repeat 0
>     abs (x:xs)   = error "Can't form abs of a power series"


> instance (Eq r, Fractional r) => Fractional [r] where
>     x/y = x ^/ y
>     fromRational x    = fromRational x:repeat 0


> sqrt' x = 1 : rs where rs = map (/2) (xs ^- (rs ⊗ (0:rs)))
>                        _ : xs = x
> instance (Eq r, Fractional r) => Floating [r] where
>     sqrt (1 : x) = sqrt' (1 : x)
>     sqrt _  = error "Can only find sqrt when leading term is 1"
>     exp x   = e where e = 1+integrate (e * d x)
>     log x   = integrate (d x/x)
>     sin x   = integrate ((cos x)*(d x))
>     cos x   = [1] ... negate (integrate ((sin x)*(d x)))
>     asin x  = integrate (d x/sqrt(1-x*x))
>     atan x  = integrate (d x/(1+x*x))
>     acos x  = error "Unable to form power series for acos"
>     sinh x  = integrate ((cosh x)*(d x))
>     cosh x  = [1] ... integrate ((sinh x)*(d x))
>     asinh x = integrate (d x/sqrt(1+x*x))
>     atanh x = integrate (d x/(1-x*x))
>     acosh x = error "Unable to form power series for acosh"
>     pi      = error "There is no formal power series for pi"


> lead [] x = x
> lead (a:as) x = a : (lead as (tail x))
> a ... x = lead a x


> (//) :: Fractional a => [a] -> (Integer -> Bool) -> [a]
> (//) a c = zipWith (\a-> \b->(if (c a :: Bool) then b else 0)) [(0::Integer)..] a

A direct functional square root that doesn't use and :

> fsqrt (0 : 1 : fs) =
>     let gs = (fs-(0 : gs*((0 : delta gs gs)+((2 : gs)*(gs*g)))))/2
>         g = 0 : 1 : gs
>         delta (g : gs) h = let g' = delta gs h
>                    in (0 : ((1 : h) * g')) + gs
>     in g

Building free arrows from components

2017-01-09T08:33:00.000-08:00

Introduction

Gabriel Gonzalez has written quite a bit about the practical applications of free monads. And "haoformayor" wrote a great stackoverflow post on how arrows are related to strong profunctors. So I thought I'd combine these and apply them to arrows built from profunctors: free arrows. What you get is a way to use arrow notation to build programs, but defer the interpretation of those programs until later.

Heteromorphisms

Using the notation here I'm going to call an element of a type P a b, where P is a profunctor, a heteromorphism.

A product that isn't much of a product

As I described a while back you can compose profunctors. Take a look at the code I used, and also Data.Functor.Composition.

data Compose f g d c = forall a. Compose (f d a) (g a c)

An element of Compose f g d c is just a pair of heteromorphisms, one from each of the profunctors, f and g, with the proviso that the "output" type of one is compatible with the "input" type of the other. As products go it's pretty weak in the sense that no composition happens beyond the two objects being stored with each other. And that's the basis of what I'm going to talk about. The Compose type is just a placeholder for pairs of heteromorphisms whose actual "multiplication" is being deferred until later. This is similar to the situation with the free monoid, otherwise known as a list. We can "multiply" two lists together using mappend but all that really does is combine the elements into a bigger list. The elements themselves aren't touched in any way. That suggests the idea of using profunctor composition in the same way that (:) is used to pair elements and lists.

Free Arrows

Here's some code:

> {-# OPTIONS -W #-}
> {-# LANGUAGE ExistentialQuantification #-}
> {-# LANGUAGE Arrows #-}
> {-# LANGUAGE RankNTypes #-}
> {-# LANGUAGE TypeOperators #-}
> {-# LANGUAGE FlexibleInstances #-}


> import Prelude hiding ((.), id)
> import Control.Arrow
> import Control.Category
> import Data.Profunctor
> import Data.Monoid


> infixr :-


> data FreeA p a b = PureP (a -> b)
>                  | forall x. p a x :- FreeA p x b

First look at the second line of the definition of FreeA. It says that a FreeA p a b might be a pair consisting of a head heteromorphism whose output matches the input of another FreeA. There's also the PureP case which is acting like the empty list []. The reason we use this is that for our composition, (->) acts a lot like the identity. In particular Composition (->) p a b is isomorphic to p a b (modulo all the usual stuff about non-terminating computations and so on). This is because an element of this type is a pair consisting of a function a -> x and a heteromorphism p x b for some type x we don't get to see. We can't project back out either of these items without information about the type of x escaping. So the only thing we can possibly do is use lmap to apply the function to the heteromorphism giving us an element of p a b.

Here is a special case of PureP we'll use later:

> nil :: Profunctor p => FreeA p a a
> nil = PureP id

So an element of FreeA is a sequence of heteromorphisms. If heteromorphisms are thought of as operations of some sort, then an element of FreeA is a sequence of operations waiting to be composed together into a program that does something. And that's just like the situation with free monads. Once we've build a free monad structure we apply an interpreter to it to evaluate it. This allows us to separate the "pure" structure representing what we want to do from the code that actually does it.

The first thing to note is our new type is also a profunctor. We can apply lmap and rmap to a PureP function straightforwardly. We apply lmap directly to the head of the list and we use recursion to apply rmap to the PureP at the end:

> instance Profunctor b => Profunctor (FreeA b) where
>     lmap f (PureP g) = PureP (g . f)
>     lmap f (g :- h) = (lmap f g) :- h
>     rmap f (PureP g) = PureP (f . g)
>     rmap f (g :- h) = g :- (rmap f h)

We also get a strong profunctor by applying first' all the way down the list:

> instance Strong p => Strong (FreeA p) where
>     first' (PureP f) = PureP (first' f)
>     first' (f :- g) = (first' f) :- (first' g)

We can now concatenate our lists of heteromorphisms using code that looks a lot like the typical implementation of (++):

> instance Profunctor p => Category (FreeA p) where
>     id = PureP id
>     g . PureP f = lmap f g
>     k . (g :- h) = g :- (k . h)

Note that it's slightly different to what you might have expected compared to (++) because we tend to write composition of functions "backwards". Additionally, there is another definition of FreeA we could have used that's analogous to using snoc lists instead of cons lists.

And now we have an arrow. I'll leave the proofs that the arrow laws are obeyed as an exercise :-)

> instance (Profunctor p, Strong p) => Arrow (FreeA p) where
>     arr = PureP
>     first = first'

The important thing about free things is that we can apply interpreters to them. For lists we have folds:

foldr :: (a -> b -> b) -> b -> [a] -> b

In foldr f e we can think of f as saying how (:) should be interpreted and e as saying how [] should be interpreted.

Analogously, in Control.Monad.Free in the free package we have:

foldFree :: Monad m => (forall x . f x -> m x) -> Free f a -> m a
foldFree _ (Pure a)  = return a
foldFree f (Free as) = f as >>= foldFree f

Given a natural transformation from f to m, foldFree extends it to all of Free f.

Now we need a fold for free arrows:

> foldFreeA :: (Profunctor p, Arrow a) =>
>              (forall b c.p b c -> a b c) -> FreeA p b c -> a b c
> foldFreeA _ (PureP g) = arr g
> foldFreeA f (g :- h) = foldFreeA f h . f g

It's a lot like an ordinary fold but uses the arrow composition law to combine the interpretation of the head with the interpretation of the tail.

"Electronic" components

Let me revisit the example from my previous article. I'm going to remove things I won't need so my definition of Circuit is less general here. Free arrows are going to allow us to define individual components for a circuit, but defer exactly how those components are interpreted until later.

I'll use four components this time: a register we can read from, one we can write from and a register incrementer, as well as a "pure" component. But before that, let's revisit Gabriel's article that gives some clues about how components should be built. In particular, look at the definition of TeletypeF:

data TeletypeF x
  = PutStrLn String x
  | GetLine (String -> x)
  | ExitSuccess

We use GetLine to read a string, and yet the type of GetLine k could be TeletypeF a for any a. The reason is that free monads work with continuations. Instead of GetLine returning a string to us, it's a holder for a function that says what we'd like to do with the string once we have it. That means we can leave open the question of where the string comes from. The function foldFree can be used to provide the actual string getter.

Free arrows are like "two-sided" free monads. We don't just provide a continuation saying what we'd like to do to our output. We also get to say how we prepare our data for input.

There's also some burden put on us. Free arrows need strong profunctors. Strong profunctors need to be able to convey extra data alongside the data we care about - that's what first' is all about. This means that even though Load is functionally similar to GetLine, it can't simply ignore its input. So we don't have Load (Int -> b), and instead have Load ((a, Int) -> b. Here is our component type:

> data Component a b = Load ((a, Int) -> b)
>                    | Store (a -> (b, Int))
>                    | Inc (a -> b)

The Component only knows about the data passing through, of type a and b. It doesn't know anything about how the data in the registers is stored. That's the part that will be deferred to later. We intend for Inc to increment a register. But as it doesn't know anything about registers nothing in the type of Inc refers to that. (It took a bit of experimentation for me to figure this out and there may be other ways of doing things. Often with code guided by category theory you can just "follow your nose" as there's one way that works and type checks. Here I found a certain amount of flexibility in how much you store in the Component and how much is deferred to the interpreter.)

I could implement the strong profunctor instances using various combinators but I think it might be easiest to understand when written explicitly with lambdas:

> instance Profunctor Component where
>     lmap f (Load g) = Load $ \(a, s) -> g (f a, s)
>     lmap f (Store g) = Store (g . f)
>     lmap f (Inc g) = Inc (g . f)


>     rmap f (Load g) = Load (f . g)
>     rmap f (Store g) = Store $ \a -> let (b, t) = g a
>                                      in  (f b, t)
>     rmap f (Inc g) = Inc (f . g)


> instance Strong Component where
>     first' (Load g) = Load $ \((a, x), s) -> (g (a, s), x)
>     first' (Store g) = Store $ \(a, x) -> let (b, t) = g a
>                                           in  ((b, x), t)
>     first' (Inc g) = Inc (first' g)

And now we can implement individual components. First a completely "pure" component:

> add :: Num a => FreeA Component (a, a) a
> add = PureP $ uncurry (+)

And now the load and store operations.

> load :: FreeA Component () Int
> load = Load (\(_, a) -> a) :- nil


> store :: FreeA Component Int ()
> store = Store (\a -> ((), a)) :- nil


> inc :: FreeA Component a a
> inc = Inc id :- nil

Finally we can tie it all together in a complete function using arrow notation:

> test = proc () -> do
>     () <- inc   -< ()
>     a  <- load  -< ()
>     b  <- load  -< ()
>     c  <- add   -< (a, b)
>     () <- store -< c


>     returnA -< ()

At this point, the test object is just a list of operations waiting to be executed. Now I'll give three examples of semantics we could provide. The first uses a state arrow type similar to the previous article:

> newtype Circuit s a b = C { runC :: (a, s) -> (b, s) }


> instance Category (Circuit s) where
>     id = C id
>     C f . C g = C (f . g)


> instance Arrow (Circuit s) where
>     arr f = C $ \(a, s) -> (f a, s)
>     first (C g) = C $ \((a, x), s) -> let (b, t) = g (a, s)
>                                       in  ((b, x), t)

Here is an interpreter that interprets each of our components as an arrow. Note that this is where, among other things, we provide the meaning of the Inc operation:

> exec :: Component a b -> Circuit Int a b
> exec (Load g) = C $ \(a, s) -> (g (a, s), s)
> exec (Store g) = C $ \(a, _) -> g a
> exec (Inc g) = C $ \(a, s) -> (g a, s+1)

Here's a completely different interpreter that is going to make you do the work of maintaining the state used by the resgisters. You'll be told what to do! We'll use the Kleisli IO arrow to do the I/O.

> exec' :: Component a b -> Kleisli IO a b
> exec' (Load g) = Kleisli $ \a -> do
>     putStrLn "What is your number now?"
>     s <- fmap read getLine
>     return $ g (a, s)
> exec' (Store g) = Kleisli $ \a -> do
>     let (b, t) = g a
>     putStrLn $ "Your number is now " ++ show t ++ "."
>     return b
> exec' (Inc g) = Kleisli $ \a -> do
>     putStrLn "Increment your number."
>     return $ g a

The last interpreter is simply going to sum values associated to various components. They could be costs in dollars, time to execute, or even strings representing some kind of simple execution trace.

> newtype Labelled m a b = Labelled { unLabelled :: m }


> instance Monoid m => Category (Labelled m) where
>     id = Labelled mempty
>     Labelled a . Labelled b = Labelled (a `mappend` b)


> instance Monoid m => Arrow (Labelled m) where
>     arr _ = Labelled mempty
>     first (Labelled m) = Labelled m


> exec'' (Load _) = Labelled (Sum 1)
> exec'' (Store _) = Labelled (Sum 1)
> exec'' (Inc _) = Labelled (Sum 2)

Note that we can't assign non-trivial values to "pure" operations.

And now we execute all three:

> main = do
>     print $ runC (foldFreeA exec test) ((), 10)
>     putStrLn "Your number is 10." >> runKleisli (foldFreeA exec' test) ()
>     print $ getSum $ unLabelled $ foldFreeA exec'' test

Various thoughts

I don't know if free arrows are anywhere near as useful as free monads, but I hope I've successfully illustrated one application. Note that because arrow composition is essentially list concatenation it may be more efficient to use a version of Hughes lists. This is what the Cayley representation is about in the monoid notions paper. But it's easier to see the naive list version first. Something missing from here that is essential for electronics simulation is the possibility of using loops. I haven't yet thought too much about what it means to build instances of ArrowLoop freely.

Profunctors have been described as decategorised matrices in the sense that p a b, with p a profunctor, is similar to the matrix . Or, if you're working in a context where you distinguish between co- and contravariant vectors, it's similar to . The Composition operation is a lot like the definition of matrix product. From this perspective, the FreeA operation is a lot like the function on matrices that takes to . To work with ArrowLoop we need a trace-like operation.

One nice application of free monads is in writing plugin APIs. Users can write plugins that link to a small library based on a free monad. These can then be dynamically loaded and interpreted by an application at runtime, completely insulating the plugin-writer from the details of the application. You can think of it as a Haskell version of the PIMPL idiom. Free arrows might give a nice way to write plugins for dataflow applications.

People typically think of functors as containers. So in a free monad, each element is a container of possible futures. In a free arrow the relationship between the current heteromorphism and its "future" (and "past") is a bit more symmetrical. For example, for some definitions of P, a heteromorphism P a b can act on some as to give us some bs. But some definitions of P can run "backwards" and act on elements of b -> r to give us elements of a -> r. So when I use the words "input" and "output" above, you might not want to take them too literally.

Addressing Pieces of State with Profunctors

2017-01-07T13:46:00.000-08:00

Attempted segue

Since I first wrote about profunctors there has been quite a bit of activity in the area so I think it's about time I revisited them. I could just carry on from where I left off 5 years ago but there have been so many tutorials on the subject that I think I'll have to assume you've looked at them. My favourite is probably Phil Freeman's Fun with Profunctors. What I intend to do here is solve a practical problem with profunctors.

The problem

Arrows are a nice mechanism for building circuit-like entities in code. In fact, they're quite good for simulating electronic circuits. Many circuits are very much like pieces of functional code. For example an AND gate like this

can be nicely modelled using a pure function: c = a && b. But some components, like flip-flops, have internal state. What comes out of the outputs isn't a simple function of the inputs right now, but depends on what has happened in the past. (Alternatively you can take the view that the inputs and outputs aren't the current values but the complete history of the values.)

We'll use (Hughes) arrows rather than simple functions. For example, one kind of arrow is the Kleisli arrow. For the case of Kleisli arrows built from the state monad, these are essentially functions of type a -> s -> (b, s) where s is our state. We can write these more symmetrically as functions of type (a, s) -> (b, s). We can think of these as "functions" from a to b where the output is allowed to depend on some internal state s. I'll just go ahead and define arrows like this right now.

First the extensions and imports:

> {-# OPTIONS -W #-}
> {-# LANGUAGE Arrows #-}
> {-# LANGUAGE RankNTypes #-}
> {-# LANGUAGE FlexibleInstances #-}


> import Prelude hiding ((.), id)
> import Control.Arrow
> import Control.Category
> import Data.Profunctor
> import Data.Tuple

And now I'll define our stateful circuits. I'm going to make these slightly more general than I described allowing circuits to change the type of their state:

> newtype Circuit s t a b = C { runC :: (a, s) -> (b, t) }


> instance Category (Circuit s s) where
>     id = C id
>     C f . C g = C (f . g)


> instance Arrow (Circuit s s) where
>     arr f = C $ \(a, s) -> (f a, s)
>     first (C g) = C $ \((a, x), s) -> let (b, t) = g (a, s)
>                                       in  ((b, x), t)

This is just a more symmetrical rewrite of the state monad as an arrow. The first method allows us to pass through some extra state, x, untouched.

Now for some circuit components. First the "pure" operations, a multiplier and a negater:

> mul :: Circuit s s (Int, Int) Int
> mul = C $ \((x, y), s) -> (x*y, s)


> neg :: Circuit s s Int Int
> neg = C $ \(x, s) -> (-x, s)

And now some "impure" ones that read and write some registers as well as an accumulator:

> store :: Circuit Int Int Int ()
> store = C $ \(x, _) -> ((), x)


> load :: Circuit Int Int () Int
> load = C $ \((), s) -> (s, s)


> accumulate :: Circuit Int Int Int Int
> accumulate = C $ \(a, s) -> (a, s+a)

I'd like to make a circuit that has lots of these components, each with its own state. I'd like to store all of these bits of state in a larger container. But that means that each of these components needs to have a way to address its own particular substate. That's the problem I'd like to solve.

Practical profunctor optics

In an alternative universe lenses were defined using profunctors. To find out more I recommend Phil Freeman's talk that I linked to above. Most of the next paragraph is just a reminder of what he says in that talk and I'm going to use the bare minimum to do the job I want.

Remember that one of the things lenses allow you to do is this: suppose we have a record s containing a field of type a and another similar enough kind of record t with a field of type b. Among other things, a lens gives a way to take a rule for modifying the a field to a b field and extend it to a way to modify the s record into a t record. So we can think of lenses as giving us functions of type (a -> b) -> (s -> t). Now if p is a profunctor then you can think of p a b as being a bit function-like. Like functions, profunctors typically (kinda, sorta) get used to consume (zero or more) objects of type a and output (zero or more) objects of type b. So it makes sense to ask our lenses to work with these more general objects too, i.e. we'd like to be able to get something of type p a b -> p s t out of a lens. A strong profunctor is one that comes pre-packed with a lens that can do this for the special case where the types s and t are 2-tuples. But you can think of simple records as being syntactic sugar for tuples of fields, so strong profunctors also automatically give us lenses for records. Again, watch Phil's talk for details.

So here is our lens type:

> type Lens s t a b = forall p. Strong p => p a b -> p s t

Here are lenses that mimic the well known ones from Control.Lens:

> _1 :: Lens (a, x) (b, x) a b
> _1 = first'


> _2 :: Lens (x, a) (x, b) a b
> _2 = dimap swap swap . first'

(Remember that dimap is a function to pre- and post- compose a function with two others.)

Arrows are profunctors. So Circuit s s, when wrapped in WrappedArrow, is a profunctor. So now we can directly use the Circuit type with profunctor lenses. This is cool, but it doesn't directly solve our problem. So we're not going to use this fact. We're interested in addressing the state of type s, not the values of type a and b passed through our circuits. In other words, we're interested in the fact that Circuit s t a b is a profunctor in s and t, not a and b. To make this explicit we need a suitable way to permute the arguments to Circuit:

> newtype Flipped p s t a b = F { unF :: p a b s t }

(It was tempting to call that ComedyDoubleAct.)

And now we can define:

> instance Profunctor (Flipped Circuit a b) where
>     lmap f (F (C g)) = F $ C $ \(a, s) -> g (a, f s)
>     rmap f (F (C g)) = F $ C $ \(a, s) -> let (b, t) = g (a, s)
>                                           in  (b, f t)


> instance Strong (Flipped Circuit a b) where
>     first' (F (C g)) = F $ C $ \(a, (s, x)) -> let (b, t) = g (a, s)
>                                                in  (b, (t, x))

Any time we want to use this instance of Profunctor with a Circuit we have to wrap everything with F and unF. The function dimap gives us a convenient way to implement such wrappings.

Let's implement an imaginary circuit with four bits of state in it.

Here is the state:

> data CPU = CPU { _x :: Int, _y :: Int, _z :: Int, _t :: Int } deriving Show

As I don't have a complete profunctor version of a library like Control.Lens with its template Haskell magic I'll set things up by hand. Here's a strong-profunctor-friendly version of the CPU and a useful isomorphism to go with it:

> type ExplodedCPU = (Int, (Int, (Int, Int)))


> explode :: CPU -> ExplodedCPU
> explode (CPU u v w t) = (u, (v, (w, t)))


> implode :: ExplodedCPU -> CPU
> implode (u, (v, (w, t))) = CPU u v w t

And now we need adapters that take lenses for an ExplodedCPU and (1) apply them to a CPU the way Control.Lens would...

> upgrade :: Profunctor p =>
>            (p a a -> p ExplodedCPU ExplodedCPU) ->
>            (p a a -> p CPU CPU)
> upgrade f = dimap explode implode . f


> x, y, z, t :: Flipped Circuit a b Int Int -> Flipped Circuit a b CPU CPU
> x = upgrade _1
> y = upgrade $ _2 . _1
> z = upgrade $ _2 . _2 . _1
> t = upgrade $ _2 . _2 . _2

...and (2) wrap them so they can be used on the flipped profunctor instance of Circuit:

> (!) :: p s t a b -> (Flipped p a b s t -> Flipped p a b s' t') ->
>        p s' t' a b
> x ! f = dimap F unF f x

After all that we can now write a short piece of code that represents our circuit. Notice how we can apply the lenses x, ..., t directly to our components to get them to use the right pieces of state:

> test :: Circuit CPU CPU () ()
> test = proc () -> do
>     a  <- load ! x       -< ()
>     b  <- load ! y       -< ()
>     c  <- mul            -< (a, b)
>     d  <- neg            -< c
>     e  <- accumulate ! t -< d
>     () <- store ! z      -< e


>     returnA              -< ()


> main :: IO ()
> main = do
>     print $ runC test ((), CPU 2 30 400 5000)

Of course with a suitable profunctor lens library you can do a lot more, like work with traversable containers of components.

Note that we could also write a version of all this code using monads instead of arrows. But it's easier to see the symmetry in Flipped Circuit when using arrows, and it also sets the scene for the next thing I want to write about...

Expectation-Maximization with Less Arbitrariness

2016-10-16T16:04:00.000-07:00

Introduction

Google have stopped supporting the Chart API so all of the mathematics notation below is missing. There is a PDF version of this article at GitHub.

There are many introductions to the Expectation-Maximisation algorithm. Unfortunately every one I could find uses arbitrary seeming tricks that seem to be plucked out of a hat by magic. They can all be justified in retrospect, but I find it more useful to learn from reusable techniques that you can apply to further problems. Examples of tricks I've seen used are:

Using Jensen's inequality. It's easy to find inequalities that apply in any situation. But there are often many ways to apply them. Why apply it to this way of writing this expression and not that one which is equal?
Substituting in the middle of an expression. Again, you can use just about anywhere. Why choose this at this time? Similarly I found derivations that insert a into an expression.
Majorisation-Minimisation. This is a great technique, but involves choosing a function that majorises another. There are so many ways to do this, it's hard to imagine any general purpose method that tells you how to narrow down the choice.

My goal is to fill in the details of one key step in the derivation of the EM algorithm in a way that makes it inevitable rather than arbitrary. There's nothing original here, I'm merely expanding on a stackexchange answer.

Generalities about EM

The EM algorithm seeks to construct a maximum likelihood estimator (MLE) with a twist: there are some variables in the system that we can't observe.

First assume no hidden variables. We assume there is a vector of parameters that defines some model. We make some observations . We have a probability density that depends on . The likelihood of given the observations is . The maximum likelhood estimator for is the choice of that maximises for the we have observed.

Now suppose there are also some variables that we didn't get to observe. We assume a density . We now have

where we sum over all possible values of . The MLE approach says we now need to maximise

One of the things that is a challenge here is that the components of might be mixed up among the terms in the sum. If, instead, each term only referred to its own unique block of , then the maximisation would be easier as we could maximise each term independently of the others. Here's how we might move in that direction. Consider instead the log-likelihood

Now imagine that by magic we could commute the logarithm with the sum. We'd need to maximise

One reason this would be to our advantage is that often takes the form where is a simple function to optimise. In addition, may break up as a sum of terms, each with its own block of 's. Moving the logarithm inside the sum would give us something we could easily maximise term by term. What's more, the for each is often a standard probability distribution whose likelihood we already know how to maximise. But, of course, we can't just move that logarithm in.

Maximisation by proxy

Sometimes a function is too hard to optimise directly. But if we have a guess for an optimum, we can replace our function with a proxy function that approximates it in the neighbourhood of our guess and optimise that instead. That will give us a new guess and we can continue from there. This is the basis of gradient descent. Suppose is a differentiable function in a neighbourhood of . Then around we have

We can try optimising with respect to within a neighbourhood of . If we pick a small circular neighbourhood then the optimal value will be in the direction of steepest descent. (Note that picking a circular neighbourhood is itself a somewhat arbitrary step, but that's another story.) For gradient descent we're choosing because it matches both the value and derivatives of at . We could go further and optimise a proxy that shares second derivatives too, and that leads to methods based on Newton-Raphson iteration.

We want our logarithm of a sum to be a sum of logarithms. But instead we'll settle for a proxy function that is a sum of logarithms. We'll make the derivatives of the proxy match those of the original function precisely so we're not making an arbitrary choice.

Write

The are constants we'll determine. We want to match the derivatives on either side of the at :

On the other hand we have

To achieve equality we want to make these expressions match. We choose

Our desired proxy function is:

So the procedure is to take an estimated and obtain a new estimate by optimising this proxy function with respect to . This is the standard EM algorithm.

It turns out that this proxy has some other useful properties. For example, because of the concavity of the logarithm, the proxy is always smaller than the original likelihood. This means that when we optimise it we never optimise ``too far'' and that progress optimising the proxy is always progress optimising the original likelihood. But I don't need to say anything about this as it's all part of the standard literature.

Afterword

As a side effect we have a general purpose optimisation algorithm that has nothing to do with statistics. If your goal is to compute

you can iterate, at each step computing

where is the previous iteration. If the take a convenient form then this may turn out to be much easier.

Note

This was originally written as a PDF using LaTeX. It'll be available here for a while. Some fidelity was lost when converting it to HTML.

Dimensionful Matrices

2016-08-06T19:23:00.000-07:00

Introduction

Programming languages and libraries for numerical work tend not to place a lot of emphasis on the types of their data. For example Matlab, R, Octave, Fortran, and Numpy (but not the now defunct Fortress) all tend to treat their data as plain numbers meaning that any time you have a temperature and a mass, say, there is nothing to prevent you adding them.

I've been wondering how much dimensions (in the sense of dimensional analysis) and units could help with numerical programming. As I pointed out on G+ recently (which is where I post shorter stuff these days), you don't have to limit dimensions to the standard ones of length, mass, time, dollars and so on. Any scale invariance in the equations you're working with can be exploited as a dimension giving you a property that can be statically checked by a compiler.

There are quite a few libraries to statically check dimensions and units now. For example Boost.Units for C++, units for Haskell and even quantities for Idris.

A matrix that breaks things

Even if a language supports dimensions, it's typical to define objects like vectors and matrices as homogeneous containers of quantities. But have a look at the Wikipedia page on the metric tensor. There is a matrix

which has the curious property that 3 entries on the diagonal seem to be dimensionless while the first entry is a squared velocity with dimension . This will break many libraries that support units. An obvious workaround is to switch to use natural units, which is much the same as abandoning the usefulness of dimensions. But there's another way, even if it may be tricky to set up with existing languages.

Heterogeneous vectors and matrices

According to a common convention in physics, a 4-vector has dimensions where I'm using the convention that we can represent the units of a vector or matrix simply as a vector or matrix of dimensions, and here is time and is length. The metric tensor is used like this: (where I'm using the Einstein summation convention so the 's and 's are summed over). If we think of having units of length squared (it is a pseudo-Riemannian metric after all) then it makes sense to think of having dimensions given by

We can write this more succinctly as

where is the usual outer product.

I'll use the notation to mean is of type . So, for example, . I'll also use pointwise notation for types such as and .

Now I can give some general rules. If is a matrix, and are vectors, and is a scalar, then only makes sense if . Similarly the "inner product" only makes sense if .

Generic vectors and matrices

Although these kinds of types might be useful if you're dealing with the kind of heterogeneous matrices that appear in relativity, there's another reason they might be useful. If you write code (in the imaginary language that supports these structures and understands dimensions and units) to be as generic as possible in the types of the vector and matrix entries, failures to type check will point out parts of the code where there are hidden assumptions, or even errors, about scaling. For example, consider a routine to find the inverse of a 3 by 3 matrix. Writing this generically as possible means we should write it to operate on a matrix of type , say. The result should have type . If this type checks when used with a suitably powerful type checker then it means that if we replace the units for type A, say, with units twice as large, it should have no effect on the result, taking into account those units. In this case, it means that if we multiply the numbers of the first row of the input by 0.5 then the numbers of the first column of the output should get multiplied by 2. In fact this is a basic property of matrix inverses. In other words, this mathematical property of matrix inverses is guaranteed by a type system that can handle units and heterogeneous matrices. It would be impossible to write a matrix inverter that type checks and fails to have this property. Unfortunately it's still possible to write a matrix inverter that type checks and is incorrect some other way. Nonetheless this kind of type system would put a very big constraint on the code and is likely to eliminate many sources of error.

An example, briefly sketched

I thought I'd look at an actual example of a matrix inverter to see what would happen if I used a type checker like the one I've described. I looked at the conjugate gradient method. At the Wikipedia page, note the line

This would immediately fail to type check because if is of generic vector type then isn't the same type as so they can't be added. I won't go into any of the details but the easiest way to patch up this code to make it type check is to introduce a new matrix of type and besides using it to make this inner product work (replacing the numerator by ) we also use anywhere in the code we need to convert a vector of type to a vector of type . If you try to do this as sparingly as possible you'll end up with a modified algorithm. But at first this seems weird. Why should this matrix inverse routine rely on someone passing in a second matrix to make it type check? And what is this new algorithm anyway? Well scroll down the Wikipedia page and you get to the preconditioned conjugate gradient algorithm. The extra matrix we need to pass in is the preconditioner. This second algorithm would type check. Preconditioned conjugate gradient, with a suitable preconditioner, generally performs better than pure conjugate gradient. So in this case we're getting slightly more than a check on our code's correctness. The type checker for our imaginary language would give a hint on how to make the code perform better. There's a reason for this. The original conjugate gradient algorithm is implicitly making a choice of units that sets scales along the axes. These determine the course taken by the algorithm. It's not at all clear that picking these scalings randomly (which is in effect what you're doing if you throw a random problem at the algorithm) is any good. It's better to pick a preconditioner adapted to the scale of the problem and the type checker is hinting (or would be if it existed) that you need to do this. Compare with the gradient descent algorithm whose scaling problems are better known.

But which language?

I guess both Agda and Idris could be made to implement what I've described. However, I've a hunch it might not be easy to use in practice.

Cofree meets Free

2014-05-23T22:21:00.000-07:00

> {-# LANGUAGE RankNTypes, MultiParamTypeClasses, TypeOperators #-}

Introduction

After I spoke at BayHac 2014 about free monads I was asked about cofree comonads. So this is intended as a sequel to that talk. Not only am I going to try to explain what cofree comonads are. I'm also going to point out a very close relationship between cofree comonads and free monads.

At the beginning of the talk the Google Hangout software seems to have switched to the laptop camera so you can't see the slides in the video. However the slides are here.

Cothings as machines

I often think of coalgebraic things as machines. They have some internal state and you can press buttons to change that internal state. For example here is a type class for a machine with two buttons that's related to a magma:

> class TwoButton a where
>   press :: a -> (a, a)

The idea is that the state of the machine is given by some type a and you could press either the left button or the right button. The result of pressing one or other button is given by these two functions:

> pressLeft, pressRight :: TwoButton a => a -> a
> pressLeft = fst . press
> pressRight = snd . press

(As with many metaphors used to explain Haskell type classes your mileage may vary. Sometimes you'll have to stretch your imagination to see what the set of buttons is for a particular cothing.)

Comonads

Just as monads are a kind of generalised algebraic structure (for example see my talk), comonads are a generalised kind of machine. The idea is that for any state of the machine there is a bunch of buttons we could press. But we don't have two buttons, or any fixed number of buttons. We instead have a functorful of buttons (if you think of functors by analogy with containers). We also don't get to directly see the internal state of the machine but instead we get to make observations.

Here's the type class:

> class Comonad w where
>   extract :: w a -> a
>   duplicate :: w a -> w (w a)

The state of the machine is given by w a. We observe the state using the extract function. And when we come to press a button, we have a functorful of new states that it could end up in. The duplicate function gives the container of those new states.

For example, various kinds of zipper give rise to comonads. Zippers allow you to "focus" on a part of a data structure. The extract operation allows you to observe the point that currently has focus. There is one button for every position in the structure where the focus could be. Pressing the corresponding button moves the focus to that point. Similarly the Store comonad has one button for each value you can store in the field it represents. Press the button and the value gets stored in the field.

Cofreeness as a way to memoise

Cofree coalgebras can be thought of as memoised forms of elements of coalgebras. For example, the TwoButton machine above has a function, press, as part of its definition. Memoising an element of such a thing means tabulating everything that could possibly happen if you pressed the buttons so we no longer need the press function. One approach is to try something like this:

data CofreeTwoButton = Memo CofreeTwoButton CofreeTwoButton

The structure contains two CofreeTwoButtons, each giving the result of pressing one of the two buttons. Any element of CofreeTwoButton may now be memoised like so:

memoiseTwoButton :: TwoButton m => m -> CofreeTwoButton
memoiseTwoButton m = Memo (memoiseTwoButton (pressLeft m)) (memoiseTwoButton (pressRight m))

It definitely tabulates the result of pressing buttons. But it has a major flaw. We have no way of seeing what's stored in the table! To make this useful we want to also store some data in the table that we can peek at. So here is a better definition:

> data CofreeTwoButton a = Memo a (CofreeTwoButton a) (CofreeTwoButton a)
> memoiseTwoButton :: TwoButton m => (m -> a) -> m -> CofreeTwoButton a
> memoiseTwoButton f m = Memo (f m) (memoiseTwoButton f (pressLeft m)) (memoiseTwoButton f (pressRight m))

The first argument to memoiseTwoButton says what we want to store in the table and then memoiseTwoButton goes ahead and stores it. We can use the identity function if we want to store the original elements.

Note how this is like foldMap:

foldMap :: Monoid m => (a -> m) -> t a -> m

if we replace t by the list functor and remember that lists are free monoids. The main difference is that arrows have been reversed. Where foldMap takes an element of a free monoid and interprets it as an element of another monoid, memoiseTwoButton packs an element of a TwoButton into a cofree structure. The "interpretation" and "packing" here are both homomorphisms for their respective structures. Homomorphisms respect equations so if an equation holds between elements of a free monoid we expect it to also hold when interpreted in another monoid. But any element of a free monoid can be interpreted in any other monoid meaning that any equation that holds between elements of a free monoid must hold in any monoid. That's why free monoids are designed so that the only equations that hold between elements are those that follow from the monoid laws.

With the TwoButton we have a dualised version of the above. Every element of every TwoButton can be packed into the CofreeTwoButton. So every equation in the original structure will still hold after the packing. So every equation that holds in some TwoButton must have some solution in CofreeTwoButton. That gives an idea of what a CofreeTwoButton is by analogy with the free monoid.

Cofree comonads

A cofree comonad is basically a memoised comonad. So the data structure is:

> data Cofree f a = Cofree a (f (Cofree f a))

At each point in the "table" we store some observable value of type a. And we have a functorful of buttons, so we expect to have a functorful of new states we could transition to. The Functor instance looks like:

> instance Functor f => Functor (Cofree f) where
>   fmap f (Cofree a fs) = Cofree (f a) (fmap (fmap f) fs)

We apply f to the observable value and then push the fmap f down to the child nodes.

The duplicate function takes a memoised state and replaces the observable stored at each position with the memoised state that gives rise to the observable.

> instance Functor f => Comonad (Cofree f) where
>   extract (Cofree a _) = a
>   duplicate c@(Cofree _ fs) = Cofree c (fmap duplicate fs)

Now by analogy with memoiseTwoButton we can memoise comonads.

> memoiseComonad :: (Comonad w, Functor f) =>
>                   (forall x.w x -> f x) -> (forall x.w x -> Cofree f x)
> memoiseComonad f w = Cofree (extract w) (fmap (memoiseComonad f) (f (duplicate w)))

So that's what a cofree comonad is: it's a type that can be used to memoise all of the states that are accessible from a state in a comonad by pressing its buttons.

Cofree comonad meets free monad

But that's not all. There is a close relationship between cofree comonads and free monads. So to get going, here's a free monad type:

> data Free f a = Id a | Free (f (Free f a))


> join' :: Functor f => Free f (Free f a) -> Free f a
> join' (Id x) = x
> join' (Free fa) = Free (fmap join' fa)


> instance Functor f => Functor (Free f) where
>   fmap f (Id x) = Id (f x)
>   fmap f (Free fa) = Free (fmap (fmap f) fa)


> instance Functor f => Monad (Free f) where
>   return = Id
>   m >>= f = join' (fmap f m)

Now I'll define a kind of pairing between functors. Given a way to combine two kinds of element, the pairing gives a way to combine a pair of containers of those elements.

> class (Functor f, Functor g) => Pairing f g where
>   pair :: (a -> b -> r) -> f a -> g b -> r


> data Identity a = Identity a
> instance Functor Identity where
>   fmap f (Identity x) = Identity (f x)


> instance Pairing Identity Identity where
>   pair f (Identity a) (Identity b) = f a b


> data (f :+: g) x = LeftF (f x) | RightF (g x)
> instance (Functor f, Functor g) => Functor (f :+: g) where
>   fmap f (LeftF x) = LeftF (fmap f x)
>   fmap f (RightF x) = RightF (fmap f x)


> data (f :*: g) x = f x :*: g x
> instance (Functor f, Functor g) => Functor (f :*: g) where
>   fmap f (x :*: y) = fmap f x :*: fmap f y


> instance (Pairing f f', Pairing g g') => Pairing (f :+: g) (f' :*: g') where
>   pair p (LeftF x) (a :*: _) = pair p x a
>   pair p (RightF x) (_ :*: b) = pair p x b


> instance (Pairing f f', Pairing g g') => Pairing (f :*: g) (f' :+: g') where
>   pair p (a :*: _) (LeftF x) = pair p a x
>   pair p (_ :*: b) (RightF x) = pair p b x


> instance Pairing ((->) a) ((,) a) where
>   pair p f = uncurry (p . f)

Given a pairing between f and g we get one between Cofree f and Free g.

> instance Pairing f g => Pairing (Cofree f) (Free g) where
>   pair p (Cofree a _) (Id x) = p a x
>   pair p (Cofree _ fs) (Free gs) = pair (pair p) fs gs

An element of Free g can be thought of as an expression written in a DSL. So this pairing gives a way to apply a monadic expression to a memoised comonad. In other words, if you think of comonads as machines, monads give a language that can be used to compute something based on the output of the machine.

Here's an almost trivial example just so you can see everything working together. A reasonable definition of a comagma structure on the type a is a -> UpDown a with UpDown defined as:

> data UpDown a = Up a | Down a


> instance Functor UpDown where
>   fmap f (Up a) = Up (f a)
>   fmap f (Down a) = Down (f a)


> type CofreeComagma a = Cofree UpDown a

A well known comagma structure on the positive integers is given by the famous Collatz conjecture:

> collatz :: Integer -> UpDown Integer
> collatz n = if even n then Down (n `div` 2) else Up (3*n+1)

We can memoise this as a cofree comonad:

> memoisedCollatz :: Integer -> CofreeComagma Integer
> memoisedCollatz n = Cofree n (fmap memoisedCollatz (collatz n))

Here's a picture of memoisedCollatz 12:

Now let's make the dual functor in readiness for building the dual monad:

> data Two a = Two a a
> instance Functor Two where
>   fmap f (Two a b) = Two (f a) (f b)

And here we set up a pairing:

> instance Pairing UpDown Two where
>   pair f (Up a) (Two b _) = f a b
>   pair f (Down a) (Two _ c) = f a c


> execute :: Cofree UpDown x -> Free Two (x -> r) -> r
> execute w m = pair (flip ($)) w m

This gives rise to a free monad isomorphic to the one in my talk:

> data Direction = WentUp | WentDown deriving Show


> choose :: Free Two Direction
> choose = Free (Two (return WentUp) (return WentDown))

And here's an example of some code written in the corresponding DSL:

> ex1 :: Free Two (Integer -> String)
> ex1 = do
>   x <- choose
>   y <- choose
>   case (x, y) of
>       (WentDown, WentDown) -> return (\z -> "Decreased twice " ++ show z)
>       _ -> return show

It can be represented as:

And here's what happens when they meet:

> go1 :: String
> go1 = execute (memoisedCollatz 12) ex1

This can be understood through the combined picture:

References

On getting monads from comonads more generally see Monads from Comonads. For more on memoising and how it's really all about the Yoneda lemma see Memoizing Polymorphic Functions. I'm waiting for Tom Leinster to publish some related work. The pairing above gives a way for elements of free monads to pick out elements of cofree comonads and is a special case of what I'm talking about here. But I think Tom has some unpublished work that goes further.

If you think of a comonad as a compressed object that is decompressed by a monadic decision tree, then you'd expect some form of information theoretical description to apply. That makes me think of Convex spaces and an operadic approach to entropy.

Types, and two approaches to problem solving

2014-05-17T08:22:00.000-07:00

Introduction

There are two broad approaches to problem solving that I see frequently in mathematics and computing. One is attacking a problem via subproblems, and another is attacking a problem via quotient problems. The former is well known though I’ll give some examples to make things clear. The latter can be harder to recognise but there is one example that just about everyone has known since infancy.

Subproblems

Consider sorting algorithms. A large class of sorting algorithms, including quicksort, break a sequence of values into two pieces. The two pieces are smaller so they are easier to sort. We sort those pieces and then combine them, using some kind of merge operation, to give an ordered version of the original sequence. Breaking things down into subproblems is ubiquitous and is useful far outside of mathematics and computing: in cooking, in finding our path from A to B, in learning the contents of a book. So I don’t need to say much more here.

Quotient problems

The term quotient is a technical term from mathematics. But I want to use the term loosely to mean something like this: a quotient problem is what a problem looks like if you wear a certain kind of filter over your eyes. The filter hides some aspect of the problem that simplifies it. You solve the simplified problem and then take off the filter. You now ‘lift’ the solution of the simplified problem to a solution to the full problem. The catch is that your filter needs to match your problem so I’ll start by giving an example where the filter doesn’t work.

Suppose we want to add a list of integers, say: 123, 423, 934, 114. We can try simplifying this problem by wearing a filter that makes numbers fuzzy so we can’t distinguish numbers that differ by less than 10. When we wear this filter 123 looks like 120, 423 looks like 420, 934 looks like 930 and 114 looks like 110. So we can try adding 120+420+930+110. This is a simplified problem and in fact this is a common technique to get approximate answers via mental arithmetic. We get 1580. We might hope that when wearing our filters, 1580 looks like the correct answer. But it doesn’t. The correct answer is 1594. This filter doesn’t respect addition in the sense that if a looks like a’ and b looks like b’ it doesn’t follow that a+b looks like a’+b’.

To solve a problem via quotient problems we usually need to find a filter that does respect the original problem. So let’s wear a different filter that allows us just to see the last digit of a number. Our original problem now looks like summing the list 3, 3, 4, 4. We get 4. This is the correct last digit. If we now try a filter that allows us to see just the last two digits we see that summing 23, 23, 34, 14 does in fact give the correct last two digits. This is why the standard elementary school algorithms for addition and multiplication work through the digits from right to left: at each stage we’re solving a quotient problem but the filter only respects the original problem if it allows us to see the digits to the right of some point, not digits to the left. This filter does respect addition in the sense that if a looks like a’ and b looks like b’ then a+b looks like a’+b’.

Another example of the quotient approach is to look at the knight’s tour problem in the case where two opposite corners have been removed from the chessboard. A knight’s tour is a sequence of knight’s moves that visit each square on a board exactly once. If we remove opposite corners of the chessboard, there is no knight’s tour of the remaining 62 squares. How can we prove this? If you don’t see the trick you can get get caught up in all kinds of complicated reasoning. So now put on a filter that removes your ability to see the spatial relationships between the squares so you can only see the colours of the squares. This respects the original problem in the sense that a knight’s move goes from a black square to a white square, or from a white square to a black square. The filter doesn’t stop us seeing this. But now it’s easier to see that there are two more squares of one colour than the other and so no knight’s tour is possible. We didn’t need to be able to see the spatial relationships at all.

(Note that this is the same trick as we use for arithmetic, though it’s not immediately obvious. If we think of the spatial position of a square as being given by a pair of integers (x, y), then the colour is given by x+y modulo 2. In other words, by the last digit of x+y written in binary. So it’s just the see-only-digits-on-the-right filter at work again.)

Wearing filters while programming

So now think about developing some code in a dynamic language like Python. Suppose we execute the line:

a = 1

The Python interpreter doesn’t just store the integer 1 somewhere in memory. It also stores a tag indicating that the data is to be interpreted as an integer. When you come to execute the line:

b = a+1

it will first examine the tag in a indicating its type, in this case int, and use that to determine what the type for b should be.

Now suppose we wear a filter that allows us to see the tag indicating the type of some data, but not the data itself. Can we still reason about what our program does?

In many cases we can. For example we can, in principle, deduce the type of

a+b*(c+1)/(2+d)

if we know the types of a, b, c, d. (As I’ve said once before, it’s hard to make any reliable statement about a bit of Python code so let's suppose that a, b, c and d are all either of type int or type float.) We can read and understand quite a bit of Python code wearing this filter. But it’s easy to go wrong. For example consider

if a>1 then:

return 1.0

else:

return 1

The type of the result depends on the value of the variable a. So if we’re wearing the filter that hides the data, then we can’t predict what this snippet of code does. When we run it, it might return an int sometimes and a float other times, and we won’t be able to see what made the difference.

In a statically typed language you can predict the type of an expression knowing the type of its parts. This means you can reason reliably about code while wearing the hide-the-value filter. This means that almost any programming problem can be split into two parts: a quotient problem where you forget about the values, and then problem of lifting a solution to the quotient problem to a solution to the full problem. Or to put that in more conventional language: designing your data and function types, and then implementing the code that fits those types.

I chose to make the contrast between dynamic and static languages just to make the ideas clear but actually you can happily use similar reasoning for both types of language. Compilers for statically typed languages, give you a lot of assistance if you choose to solve your programming problems this way.

A good example of this at work is given in Haskell. If you're writing a compiler, say, you might want to represent a piece of code as an abstract syntax tree, and implement algorithms that recurse through the tree. In Haskell the type system is strong enough that once you’ve defined the tree type the form of the recursion algorithms is often more or less given. In fact, it can be tricky to implement tree recursion incorrectly and have the code compile without errors. Solving the quotient problem of getting the types right gets you much of the way towards solving the full problem.

And that’s my main point: types aren’t simply a restriction mechanism to help you avoid making mistakes. Instead they are a way to reduce some complex programming problems to simpler ones. But the simpler problem isn’t a subproblem, it’s a quotient problem.

Dependent types

Dependently typed languages give you even more flexibility with what filters you wear. They allow you to mix up values and types. For example both C++ and Agda (to pick an unlikely pair) allow you to wear filters that hide the values of elements in your arrays while allowing you to see the length of your arrays. This makes it easier to concentrate on some aspects of your problem while completely ignoring others.

Notes

I wrote the first draft of this a couple of years ago but never published it. I was motivated to post by a discussion kicked off by Voevodsky on the TYPES mailing list http://lists.seas.upenn.edu/pipermail/types-list/2014/001745.html

This article isn’t a piece of rigorous mathematics and I’m using mathematical terms as analogies.

The notion of a subproblem isn’t completely distinct from a quotient problem. Some problems are both, and in fact some problems can be solved by transforming them so they become both.

More generally, looking at computer programs through different filters is one approach to abstract interpretation http://en.wikipedia.org/wiki/Abstract_interpretation. The intuition section there (http://en.wikipedia.org/wiki/Abstract_interpretation#Intuition) has much in common with what I’m saying.

The Monad called Free

2014-04-25T21:42:00.000-07:00

Introduction

As Dan Doel points out here, the gadget Free that turns a functor into a monad is itself a kind of monad, though not the usual kind of monad we find in Haskell. I'll call it a higher order monad and you can find a type class corresponding to this in various places including an old version of Ed Kmett's category-extras. I'll borrow some code from there. I hunted around and couldn't find an implementation of Free as an instance of this class so I thought I'd plug the gap.

> {-# LANGUAGE RankNTypes, FlexibleContexts, InstanceSigs, ScopedTypeVariables #-}


> import Control.Monad
> import Data.Monoid

To make things unambiguous I'll implement free monads in the usual way here:

> data Free f a = Pure a | Free (f (Free f a))


> instance Functor f => Functor (Free f) where
>     fmap f (Pure a) = Pure (f a)
>     fmap f (Free a) = Free (fmap (fmap f) a)


> instance Functor f => Monad (Free f) where
>     return = Pure
>     Pure a >>= f = f a
>     Free a >>= f = Free (fmap (>>= f) a)

The usual Haskell typeclass Monad corresponds to monads in the category of types and functions, Hask. We're going to want monads in the category of endomorphisms of Hask which I'll call Endo.

The objects in Endo correspond to Haskell's Functor. The arrows in Endo are the natural transformations between these functors:

> type Natural f g = (Functor f, Functor g) => forall a. f a -> g a

So now we are led to consider functors in Endo.

> class HFunctor f where

A functor in Endo must map functors in Hask to functors in Hask. So if f is a functor in Endo and g is a functor in Hask, then f g must be another functor in Hask. So there must be an fmap associated with this new functor. There's an associated fmap for every g and we collect them all into one big happy natural family:

>     ffmap :: Functor g => (a -> b) -> f g a -> f g b

But note also that by virtue of being a functor itself, f must have its own fmap type function associated with it. The arrows in Endo are natural transformations in Hask so the fmap for HFunctor must take arrows in Endo to arrows in Endo like so:

>     hfmap :: (Functor g, Functor h) => Natural g h -> Natural (f g) (f h)

Many constructions in the category Hask carry over to Endo. In Hask we can form a product of type types a and b as (a, b). In Endo we form the product of two functors f and g as

> data Product f g a = Product (f (g a))

Note that this product isn't commutative. We don't necessarily have an isomorphism from Product f g to Product g f. (This breaks many attempts to transfer constructions from Hask to Endo.) We also won't explicitly use Product because we can simply use the usual Haskell composition of functors inline.

We can implement some functions that act on product types in both senses of the word "product":

> left :: (a -> c) -> (a, b) -> (c, b)
> left f (a, b) = (f a, b)


> right :: (b -> c) -> (a, b) -> (a, c)
> right f (a, b) = (a, f b)


> hleft :: (Functor a, Functor b, Functor c) => Natural a c -> a (b x) -> c (b x)
> hleft f = f


> hright :: (Functor a, Functor b, Functor c) => Natural b c -> a (b x) -> a (c x)
> hright f = fmap f

(Compare with what I wrote here.)

We have something in Endo a bit like the type with one element in Hask, namely the identity functor. The product of a type a with the one element type in Hask gives you something isomorphic to a. In Endo the product is composition for which the identity functor is the identity. (Two different meanings of the word "identity" there.)

We also have sums. For example, if we define a functor like so

> data F a = A a | B a a

we can think of F as a sum of two functors: one with a single constructor A and another with constructor B.

We can now think about reproducing an Endo flavoured version of lists. The usual definition is isomorphic to:

> data List a = Nil | Cons a (List a)

And it has a Monoid instance:

> instance Monoid (List a) where
>   mempty = Nil
>   mappend Nil as = as
>   mappend (Cons a as) bs = Cons a (mappend as bs)

We can try to translate that into Endo. The Nil part can be thought of as being an element of a type with one element so it should become the identity functor. The Cons a (List a) part is a product of a and List a so that should get replaced by a composition. So we expect to see something vaguely like:

List' a = Nil' | Cons' (a (List' a))

That's not quite right because List' a is a functor, not a type, and so acts on types. So a better definition would be:

List' a b = Nil' b | Cons' (a (List' a b))

That's just the definition of Free. So free monads are lists in Endo. As everyone knows :-) monads are just monoids in the category of endofunctors. Free monads are also just free monoids in the category of endofunctors.

So now we can expect many constructions associated with monoids and lists to carry over to monads and free monads.

An obvious one is the generalization of the singleton map a -> List a:

> singleton :: a -> List a
> singleton a = Cons a Nil


> hsingleton :: Natural f (Free f)
> hsingleton f = Free (fmap Pure f)

Another is the generalization of foldMap. This can be found under a variety of names in the various free monad libraries out there but this implementation is designed to highlight the similarity between monoids and monads:

> foldMap :: Monoid m => (a -> m) -> List a -> m
> foldMap _ Nil = mempty
> foldMap f (Cons a as) = uncurry mappend $ left f $ right (foldMap f) (a, as)


> fold :: Monoid m => List m -> m
> fold = foldMap id


> hFoldMap :: (Functor f, Functor m, Monad m) => Natural f m -> Natural (Free f) m
> hFoldMap _ (Pure x) = return x
> hFoldMap f (Free x) = join $ hleft f $ hright (hFoldMap f) x


> hFold :: Monad f => Natural (Free f) f
> hFold = hFoldMap id

The similarity here isn't simply formal. If you think of a list as a sequence of instructions then foldMap interprets the sequence of instructions like a computer program. Similarly hFoldMap can be used to interpret programs for which the free monad provides an abstract syntax tree.

You'll find some of these functions here by different names.

Now we can consider Free. It's easy to show this is a HFunctor by copying a suitable definition for List:

> instance Functor List where
>   fmap f = foldMap (singleton . f)


> instance HFunctor Free where
>     ffmap = fmap
>     hfmap f = hFoldMap (hsingleton . f)

We can define HMonad as follows:

> class HMonad m where
>     hreturn :: Functor f => f a -> m f a
>     hbind :: (Functor f, Functor g) => m f a -> Natural f (m g) -> m g a

Before making Free an instance, let's look at how we'd make List an instance of Monad

> instance Monad List where
>     return = singleton
>     m >>= f = fold (fmap f m)

And now the instance I promised at the beginning.

> instance HMonad Free where
>     hreturn = hsingleton
>     hbind m f = hFold (hfmap f m)

I've skipped the proofs that the monad laws hold and that hreturn and hbind are actually natural transformations in Endo. Maybe I'll leave those as exercises for the reader.

Update

After writing this I tried googling for "instance HMonad Free" and I found this by haasn. There's some other good stuff in there too.

Reinversion Revisited

2014-02-01T17:53:00.000-08:00

Introduction

A while back I talked about the idea of reinversion of control using the continuation monad to wrest control back from an interface that only wants to call you, but doesn't want you to call them back. I want to return to that problem with a slightly different solution. The idea is that we build an interpreter for an imperative language that's an embedded Haskell DSL. You arrange that the DSL does the work of waiting to be called by the interface, but from the point of view of the user of the DSL it looks like you're calling the shots. To do this I'm going to pull together a bunch of techniques I've talked about before. This approach is largely an application of what apfelmus described here.

The code

We'll start with some administrative stuff before getting down to the real code:

> {-# LANGUAGE TemplateHaskell #-}


> import Control.Lens
> import Control.Monad
> import Control.Monad.Loops

We'll make our DSL an imperative wrapper around Gloss:

> import Graphics.Gloss.Interface.Pure.Game

We'll define a structure that can be used to represent the abstract syntax tree (AST) of our DSL. Our DSL will support the reading of inputs, adding pictures to the current picture, and clearing the screen.

First we'll need a wrapper that allows us to represent ordinary Haskell values in our DSL:

> data Basic a = Return a

Now we want an expression that represents events given to us by Gloss. Internally we'll represent this by a function that says what our program does if it's given an event. It says what our program does by returning another AST saying what happens when the input is received. (I've previously talked about these kinds of expression trees here).

>              | Input (Event -> Basic a)

We have a command to render some graphics. It appends a new Picture to the current picture. Again, part of the AST muct be another AST saying what happens after the picture is rendered:

>              | Render Picture (Basic a)

And lastly here's the AST for a clear screen command:

>              | Cls (Basic a)

Our AST will form a monad. This will allow us to build ASTs using ordinary Haskell do-notation. This technique is what I described previously here.

> instance Monad Basic where
>     return = Return
>     Return a >>= f = f a
>     Input handler >>= f = Input (\e -> handler e >>= f)
>     Render p a >>= f = Render p (a >>= f)
>     Cls a >>= f = Cls (a >>= f)

You can think of the expression x >>= f as x with the tree f a grafted in to replace any occurrence of Return a in it. This is exactly what Return a >>= f does. But applying >>= f to the other ASTs simply digs down "inside" the ASTs to find other occurrences of Return a.

It's convenient to uses lenses to view Gloss's game world:

> data World = World { _program :: Basic (), _picture :: Picture }
> $(makeLenses ''World)

And now we have some wrappers around the interpreter's commands. The return () provides the convenient place where we can graft subtrees into our AST.

> input = Input return
> render p = Render p (return ())
> cls = Cls (return ())

Now we can start coding. Here's a test to see if a Gloss event is a key down event:

> keydown (EventKey (Char key)            Down _ _) = True
> keydown (EventKey (SpecialKey KeySpace) Down _ _) = True
> keydown _ = False

And now here's a complete program using our DSL. It's deliberately very imperative. It simply iterates over a nested pair of loops, collecting keystrokes and displaying them. It reads a lot like an ordinary program written in a language like Python or Basic:

> mainProgram = do
>     render (Color white $ Scale 0.2 0.2 $ Text "Type some text")


>     forM_ [780, 760..] $ \ypos -> do
>         forM_ [0, 20..980] $ \xpos -> do


>             event <- iterateUntil keydown $ input


>             let key = case event of
>                         EventKey (Char key)            Down _ _ -> key
>                         EventKey (SpecialKey KeySpace) Down _ _ -> ' '


>             when (ypos == 780 && xpos == 0) $ cls
>             render $ Color white $ Translate (xpos-500) (ypos-400) $ Scale 0.2 0.2 $ Text $ [key]

Here is where we launch everything, placing our program and starting Blank picture into the World.

> main = play (InWindow "Basic" (1000, 800) (10, 10))
>  black 
>  60
>  (World mainProgram Blank)
>  (^. picture)
>  handleEvent
>  (const id)

So now we need just one more ingredient, an actual interpreter for our AST. It's the event handler:

> handleEvent :: Event -> World -> World

The Return command is purely a place to graft in subtrees. It should never be interpreted.

> handleEvent _ (World (Return a) _) = error "error!"

After receiving some input, I want the interpreter to keep interpreting commands such as Cls that don't need any more input. I'm going to do this by using a null event EventMotion (0,0). But when an input really is desired, I want this null event to be ignored.

> handleEvent (EventMotion (0, 0)) state@(World (Input handler) _) = state

We render something by mappending it to the current picture stored in the World. But the rendering is carried out by the event handler. We update the state so that at the next event, the subtree of the AST is executed. This means that after updating the picture, the event still needs to be handed back to the event handler:

> handleEvent event state@(World (Render p cont) _) = state & (picture <>~ p)    & (program .~ cont) & handleEvent event

Clearing the screen is similar:

> handleEvent event state@(World (Cls cont)      _) = state & (picture .~ Blank) & (program .~ cont) & handleEvent event

And now we need to handle inputs. We do this by applying the "what happens when the input is received" function to the event. The result is put back in the state indicating that this is what we want to happen at the next event. So the interpreter doesn't stop here, waiting for the next event, the interpreter sends itself a null event.

> handleEvent event state@(World (Input handler) _) = state & (program .~ handler event) & handleEvent (EventMotion (0, 0))

And that's it!

There are many changes that can be made. We can easily add more commands and make the state more complex. But you might also notice that we create the AST only to tear it apart again in the interpreter. We can actually elide the AST creation, but that will eventually bring us back to something like what I originally posted. This shouldn't be a big surprise, I've already shown how any monad can be replaced with the continuation monad here. By the way, it's pretty easy to add a Fork command. You can replace the _program :: Basic() field with _program :: [Basic ()] and interpret this as a list of threads using a scheduler of your choice.

Acknowledgements

I was prompted to write this (a little late, I know) after reading this article and Tekmo's post on reddit. I think ultimately continuations may perform better than using ASTs. But sometimes it's nice to build an AST because they give you an object that can easily be reasoned about and manipulated by code. Much as I love trickery with continuations, I find ASTs are much easier to think about.

Postscript

My real motivation was that I was thinking about games. The rules of games are often given in imperative style: first player 1 does this. Then they do this. If this happens they do that. And then it's player two's turn. I wanted my Haskell code to reflect that style.

Update

Added 'null' event to keep interpreter going when it makes sense to do so, but there's no event pending.

Distributed computing with alien technology

2013-10-25T21:00:00.000-07:00

Introduction

Suppose we are given a function of boolean arguments that returns a boolean result. Alice has bits, and Bob has another bits . Alice and Bob are widely separated and don't know each other's bits. What is the total number of bits that Alice has to send to Bob and that Bob has to send to Alice so that between them they can compute ? Think about how complex might get. The and might each describe half of a "voxelised" region of space and might answer a question about a computational fluid dynamics (CFD) simulation running in that space. CFD simulations can be chaotic and so we might expect that in the worst case many bits have to be transferred back and forth between Alice and Bob. In the worst case we might expect that Alice has to send Bob all of her bits, or vice versa.

But in fact Alice needs to send Bob just one bit.

A loophole

To get the communication requirements down to one bit we need to use a loophole. But I hope to (1) justify the cheat to some extent and (2) justify that it's even worthwhile to think about cheats.

Alice and Bob have access to some Ancient technology. They each have one of a pair of boxes. At prearranged times, Alice puts a bit into her box, and Bob puts a bit into his box. A bit pops back out of Alice's box and a bit pops back out of Bob's box. Whatever the input, both Alice and Box have a 0.5 chance of seeing a one or zero pop out of their respective boxes. But when the two outputs are XORed together the result is the logical AND of the two inputs. With such boxes, Alice can compute after Bob sends a single bit down a conventional communication channel.

"But this is a total cheat!" you complain before I even start to explain their technique. It seems Alice receives a bit that depends on what Bob input, and so Bob is communicating with Alice. But look closely and you'll see that the boxes don't allow any communication. No matter what Bob inputs, Alice has a 0.5 chance of getting zero or one. There is no way Bob can use this to communicate anything. It's like intercepting a message encrypted with a one time pad. Without the pad, the message is basically a sequence of random bits. Nonetheless, it is true that the outputs that Alice and Bob see are correlated.

I hope I've convinced you that Alice and Bob can't send any bits with these boxes. Despite this, it is pretty clear that the behaviour of the boxes is non-local. We'll call any kind of boxes that allow instantaneous long range correlations that can't be explained by purely local behaviour non-local boxes. Boxes that can't be used for message sending are called non-signalling local boxes. And the particular non-local box I describe above is called a PR box (eg. see here).

(BTW As an aside note that as the box results in widely separated outputs that are correlated, but doesn't allow communication, it's an example of how non-locality doesn't imply communication. Usually when people want to give examples of such a thing they talk about quantum mechanics. But there's no need to mention quantum mechanics to explain the behaviour of these particular non-local boxes.)

The method

Any single bit boolean function of a finite sequence of bits can be written as a polynomial modulo 2. Each monomial in the polynomial can be written as a product of terms involing just the and terms involving just the , ie.

where depends only on the , depends only on the and is drawn from some finite set. Alice can compute the and Bob can compute the . Now Alice and Bob, in parallel, feed and respectively into their PR boxes. We know that we could evaluate each term in the sum we want by adding Alice's output to Bob's output. But that would require sending one one-bit message for each . But we don't need each term one by one; we just want the sum. So Alice and Bob can individually sum their separate outputs knowing that adding Alice's output and Bob's output modulo 2 will be the correct sum. So Bob sends his sum to Alice. Alice adds that number to her own (modulo 2) and that's the value we want. Only one one-bit message was sent.

But what about reality?

Non-local boxes don't exist, do they? So why are we talking about them?

Actually, non-local boxes exist both theoretically and in the lab. Non-local correlations in quantum mechanics allow them to be constructed. But for this article I wanted to abstract from quantum mechanics and talk about the behaviour of a non-local box without getting my hands dirty with the details of quantum mechanics. Having said that, although non-local boxes do exist, the special case of the PR box can't in fact be constructed with quantum mechanics. In some sense it allows correlations that are "too strong". An article I wrote a while back describes the closest you can get to building a PR box with quantum correlations. Curiously, if you restrict yourself to the kind of non-local box quantum mechanics allows you to build you find that some functions can still be computed with less communication than you'd need if non-local correlations are disallowed. Nonetheless, the worst case scenario with QM still requires the sending of bits.

Going further there's an interesting conjecture. It says that any non-local box that is even marginally better (in some sense) than what quantum mechanics allows is powerful enough to allow the computation of any with only a single bit of communication. It suggests that quantum mechanics is right at the edge of the space of possible physics that make life difficult for us. If quantum mechanics were to be tweaked the tiniest amount to make correlations any stronger, large numbers of difficult distributed computing problems would suddenly collapse to become trivial. If the conjecture is true it means that nature looks a bit like a conspiracy to keep computer scientists in work. (It's possible the conjecture has been decided one way or the other by now.)

Final words

There are a couple of papers about universes where PR boxes can be built; so called boxworlds. There is a lot of interesting theoretical work in characterising quantum mechanics. In particular there are a number of theorems and conjectures that describe QM in the form "the most X theory that doesn't allow Y" where X is an interesting property and Y is something you'd like to do.

References

I learnt all of this from the paper Implausible Consequences of Superstrong Nonlocality by Wim van Dam.

What stops us defining Truth?

2013-10-12T09:12:00.000-07:00

Introduction
Recall the standard cartoon sketch of the proof of Gödel's first incompleteness theorem. We start by defining a predicate, , that is true if and only if its argument is provable. (Or more accurately, is true if is the Gödel number of a provable proposition.) With some quining we can use this to construct the proposition which says . The proposition asserts its own unprovability.
Suppose instead we define a predicate which holds if its argument is true. We can use this to construct the proposition which says . Then if is true it must also be false and if it's false then it must be true. We seem to have a paradox. The loophole is that we assumed the existence of the predicate . So this argument demonstrates that there is actually no such predicate. This is Tarski's undefinability theorem.
But what exactly stops us defining ? What goes wrong if we attempt to define a predicate that analyses the parts of a proposition to tell us whether or not it is true?

Note
This article is written in English. But as is standard in much of mathematics, unless I state otherwise, I'm using English largely as shorthand for an argument that could, in principle, be written in the formal language of Set Theory. So I will allow myself to use all of the usual reasoning methods that are available in ZF, even when talking about other formal systems such as Peano Arithmetic.

Defining Truth for Propositional Calculus
Suppose we're given a proposition from propositional calculus like . We can use a syntactic approach to determining whether or not it is true. We determine whether or not is true, then whether or not is true, and then the whole proposition is true if both and are true. Similarly is true if either or is true. Of course and might themselves be compound propositions using , and . But that's fine, that simply means that to define truth for such propositions we need to employ recursion. In fact, we can straightforwardly turn such a definition into a recursive computer program.
(Ultimately with propositional calculus we hit the leaves which are atomic propositions like . Typically when we ask about the truth of a proposition in propositional calculus we've already made an assignment of truth values to the atomic propositions. So the base case for the recursion is straightforward.)
We can illustrate the process with a diagram:

The truth value of a node in the tree is determined by the truth of the propositions hanging underneath it. We have a parent-child relation between a proposition and its subexpressions. Recursion allows us to make a definition by defining what happens on the leaves of such a tree, and by saying how the definition at a node is built from that of its children.

Defining truth for Peano Arithmetic
We can go further and attempt this approach with Peano Arithmetic (PA). The catch is that we need to consider quantifiers. For example, consider this proposition from Peano arithmetic: . This proposition is true if and only if is true whatever number we substitute for in the expression.

The proposition at the top of the tree above is true if all of the immediate children are true and their truth is in turn determined by the truth of the propositions immediately below them. With some work this eventually leads to a perfectly good definition of truth for propositions in PA. Because we have nodes with infinitely many children we don't get an algorithm guaranteed to terminate, but that's not a problem for a definition in ZF. Note that we don't literally prove the infinitely many child propositions one at a time. Instead what happens is that to define the truth of we define it in terms of the truth of some infinite family of propositions all based on . ZF is perfectly good at dealing with such definitions without us having to list every element of our family explicitly.
Note how in this case the tree isn't the parse tree of the proposition. It's much bigger with nodes that have infinite branching. But that's fine, there's nothing about infinite branching that prevents us making a recursive definition. So we can ultimately extend the idea for defining truth in propositional calculus to include quanifiers and then all of Peano arithmetic.

Defining truth for ZF
But the approach used for PA looks like it might work perfectly well for ZF as well. For example, our definition of truth would say that is true if is true whatever set we substitute for . In ZF there is no difficulty in defining a predicate that uses quantification over all sets. So it seems we can define for ZF in ZF, contradicting Tarski's theorem.

What went wrong?
Recursive definitions typically rely on the parent-child relation I mentioned above. To recursively define something we (1) define it for all leaves and then (2) specify how the definition at a parent is given in terms of the value for all of its children. We then invoke a recursion theorem of some sort to show how this uniquely defines our object for everything in our universe. For example, one form of recursion in Peano arithmetic has as its leaf and the only child of is . The induction axiom for PA can be used to show that definitions using this parent-child relation are valid.
Similarly in ZF we have the empty set as leaf and the children of a set are simply its elements. But now we need to look closely at the recursion principle we need. For ZF we need to invoke the Transfinite Recursion Theorem. Transfinite recursion is very powerful. It's not just limited to induction over sets. It can also be used for induction over classes. For example if you need to recursively define a function on the class of all sets it can allow this. (Strictly speaking it'll be a function class rather than a function.) But now comes the catch. If you take a look at the Wikipedia article it mentions that the parent-child relation, , needs to be set-like (though as the article is currently written it's almost an afterthought). For this theorem to apply we need the collection of children of a proposition to form a set. But to prove the truth of a proposition with a quantifier at the front we need to prove something is true for all children where there is one child for each set. This means the children don't form a set. So we can't use transfinite recursion. And this means the informal definition of truth I gave above can't be turned into a rigorous definition.

Conclusion
I think this issue is quite subtle. It's really easy to say in English "this thing is true if that thing is true for all sets". Such a sentence in isolation can often be turned into a rigorous proposition in ZF. But if that sentence is part of a collection of sentences that refer to each other forming an attempt at a mutually recursive definition, you need to check precisely what parent-child relation you're using.

Acknowledgement
Thanks to Sridar Ramesh for making clear to me why the attempted definition of truth in ZF doesn't work. But I've probably made some mistakes above and they have nothing to do with Sridar.

Why Heisenberg can't stop atomic collapse

2013-04-14T18:08:00.003-07:00

TL;DR

A heuristic argument to show that hydrogen atoms are stable and have a minimum energy level is wrong. I will assume undergraduate level quantum mechanics in the discussion.

Introduction

There's a popular argument used to explain why atoms are stable. It shows there is a lower bound on the energy level of an electron in the atom that makes it impossible for electrons to keep "falling" forever all the way down to the nucleus. You'll find it not only in popular science books but in courses and textbooks on quantum mechanics.

A rough version of the argument goes like this: the closer an electron falls towards the nucleus the lower its potential energy gets. But the more closely bound to the nucleus it is, the more accurately we know its position and hence, by Heisenberg's uncertainty principle (HUP), the less accurately we know its momentum. Increased variance in the momentum corresponds to an increase in kinetic energy. Eventually the decrease in potential energy as the electron falls is balanced by an increase in kinetic energy and the electron has reached a stable state.

The problem is, this argument is wrong. It's wrong related to the kind of heuristic reasoning about wavefunctions that I've talked about before.

Before showing it's wrong, let's make the argument a bit more rigorous.

Bounding wavefunctions

The idea is to show that for any possible normalised wavefunction ψ of an electron in a Coulomb potential, the expected energy is bounded below by some constant. So we need to show that

is bounded below where

and p is momentum.

Consider a wavefunction that is confined mainly around the nucleus so

The first fact we need is that Heisenberg uncertainty principle tells us that

(assuming we're in a frame of reference where the expected values of p and x are zero).

If the wavefunction is spread out with a standard deviation of a then the electron is mostly around a distance a from the nucleus. So the second fact is that we can roughly approximate the expected value of 1/r as 1/a.

Combine these two facts and we get, roughly, that

I hope you can see that the right hand side, as a function of a, is bounded below. The graph of the right hand side as a function of a looks like:

It's now an exercise in calculus to find a lower bound on the expected energy. You can find the details in countless places on the web. Here a link to an example from MIT, which may have come directly from Feynman's Lectures on Physics.

The problem

The above discussion assumes that the wavefunction is basically a single packet confined around a distance a from the nucleus, something like that graphed above. But if a lower energy state can be found with a different wavefunction the electron will eventually find it, or an even lower energy state. In fact, by using a wavefunction with multiple peaks we will find that the Heisenberg uncertainty principle doesn't give a lower bound at all.

We'll use a wavefunction like this:

It has a packet around the origin just like before but it also has a sharp peak around r=l. As I'm showing ψ as a function of r this means we have a shell of radius l.

Let's say

where ψ₁ is normalized and peaked near the original and ψ₂is our shell of radius l. Assume no overlap between ψ₁ and ψ₂.

In this case you can see that we can make

as large as we like by making l as large as we like while still leaving us free to make the central peak whatever shape we want. This means that the estimate of

coming from HUP can be made as small as we like while making the central peak as close to a Dirac delta as we want. Informally, HUP controls of the overall spread of the wave function but not the spread of individual peaks within it.

For a large enough shell, ψ₂contributes little to the total expected potential energy, but ψ₁ can contribute an arbitrarily low amount because we can concentrate it in areas where 1/r is as large as we want. So we can make the total expected potential energy as low as we like. And yet we can also keep the estimate of the kinetic energy given by HUP as close to zero as we like. So contrary to the original argument, the Heisenberg uncertainty principle doesn't give us a lower bound on the energy at all. The argument is wrong.

But wait, we know there is a lowest energy state...

Yes, the energy of a wavefunction in a Coulomb potential is in fact bounded below. After all, atoms are stable. But the Heisenberg uncertainty principle doesn't show it. The inequality in HUP becomes an equality when the wavefunction is a Gaussian function. It provides a good bound for functions that are roughly Gaussian, ie. that form a single "lump". But it provides only weak bounds for wavefunctions with multiple peaks and in this case it's not the appropriate tool to use.

The Heisenberg uncertainty principle is an inequality about ordinary functions interpreted in the context of quantum mechanics (QM). The field of functional analysis provides many such inequalities. A great paper by Lieb, The Stability of Matter, gives an inequality due to Sobolev that can also be interpreted in the context of QM. Sobolev's inequality is more appropriate when considering the hydrogen atom and it gives a good lower bound, demonstrating that the hydrogen atom is stable after all.

But wait, the Heisenberg uncertainty principle argument gives the right energy...

Getting a correct answer doesn't always justify the methods. I can give at least two reasons why the original method appears to work.

1. The HUP gives a good bound for wavefunctions that are roughly Gaussian. The lowest energy level for the hydrogen atom is given (very roughly) by such a function. So an estimate based on HUP should be roughly correct. However, HUP alone can't tell us that the lowest energy state is Gaussian. The argument is only useful if we can get this information from somewhere else.

2. You can get an estimate for the lowest energy level of the hydrogen atom (assuming it exists) by dimensional analysis. Invalid physical arguments that are dimensionally correct will often give the correct result because there is only one dimensionally correct expression possible.

But wait, it's just a heuristic argument...

Heuristic arguments are crucial to physics. But when similar heuristic arguments give opposite results they become problematic. In particular, it's no good saying an argument is inexact or qualitative when it gives a bound on the energy that isn't just off by an order of magnitude, but completely fails to give a bound at all. Part of the issue here is that the Coulomb potential goes to infinity as r goes to zero and so more care is required. The HUP argument above can be adapted to give good results when the potential is bounded below, for example it gives a reasonable estimate for square wells.

But there may be a clever way of using HUP to bound the energy that I haven't seen. If you can see it, please tell me.

The source

Most of what I said above I learnt from the excellent paper on the Stability of Matter by Lieb that I mentioned above.

Aliasing and the Heisenberg uncertainty principle.

2013-01-13T16:59:00.002-08:00

TL;DR

The Dirac comb is an example of a wavefunction whose position and momentum aren't fuzzy.

Introduction

The Heisenberg uncertainty principle says that if you have a particle in some state and observe either its momentum or its position then the products of the standard deviations of distributions of the outcomes satisfy this identity:

I think many people have a mental picture a bit like this:

You can know the position and momentum with some degree of fuzziness and you can trade the fuzziness between the two measurements as long as the product of their sizes is larger than ℏ/2.

Here's another way of thinking about that kind of picture (assuming some units I haven't specified):

position	=	123.4???
momentum	=	65?.???

The idea is that the question mark represents digits we don't know well. As you move towards the right in the decimal representation our certainty in the accuracy of the digit quickly goes downhill to the point where we can't reasonably write digits.

But this picture is highly misleading. For example, the following state of affairs is also compatible with the uncertainty principle, in suitably chosen units:

position	=	...???.123...
momentum	=	...???.654...

In other words, it's compatible with the uncertainty principle that we could know the digits beyond the decimal point to as much accuracy as we like as long as we don't know the digits before the point. It trivially satisfies Heisenberg's inequality because the variance of the position and the momentum aren't even finite quantities.

But being compatible with Heisenberg uncertainty isn't enough for something to be realisable as a physical state. Is there a wavefunction that allows us to know the digits to the right of the decimal point as far as we want for both position and momentum measurements?

Sampling audio and graphics

Maybe surprisingly, the worlds of audio and graphics can help us answer this question. Here's what a fraction of a second of music might look like when the pressure of the sound wave is plotted against time:

But if we sample this signal at regular intervals, eg. at 44.1KHz for a CD, then we can graph the resulting signal as something like this:

The red curve here is just to show what the original waveform looked like. The black vertical lines correspond to regular samples and we can represent them mathematically with Dirac delta functions multiplied by the amplitude measured at the sample.

There is a well known problem with sampling like this. If you sample a signal that is a sine wave sin(ωt) at rate f then the signal sin((ω+2πnf)t) will generate exactly the same samples for any integer n. The following illustration shows what might happen:

The two waveforms are sampled at the same regular intervals (shown by vertical lines) and give exactly the same amplitudes at those samples.

This forms the basis for the famous Nyquist-Shannon sampling theorem. You can reconstruct the original signal from regularly spaced samples only if it doesn't contain frequency components higher than half your sampling rate. Otherwise you get ambiguities in the form of high frequency parts of the signal masquerading as low frequency parts. This effect is known as aliasing. As a result, the Fourier transform of a sampled function is periodic with the "repeats" corresponding to the aliasing.

In the audio world you need to filter your sound to remove the high frequencies before you sample. This is frequently carried out with an analogue filter. In the 3D rendering world you need to do something similar. Ray tracers will send out many rays for each pixel, in effect forming a much higher resolution image than the resolution of the final result, and that high resolution image is filtered before being sampled down to the final resulting image. The "jaggies" you get from rendering polygons are an example of this phenomenon. It seems like jaggies have nothing to do with the world of Fourier transforms. But if you compute the Fourier transform of a polygonal image, remove suitable high frequency components, and then take the inverse Fourier transform before sampling you'll produce an image that's much more pleasing to the eye. In practice there are shortcuts to achieving much the same effect.

The connection to physics

Now consider a particle whose wavefunction takes the form of the Dirac comb:

This is a wavefunction that is concentrated at multiples of some quantity a, ie. ∑δ(x-an) summing over n = ...,-1,0,1,2,... If the wavefunction is ψ(x) then the probability density function for the particle position is |ψ(x)|². So the particle has a zero probability of being found at points other than those where x=na. In other words, modulo a, the particle position is given precisely.

But what about the particle momentum. Well the wavefunction has, in some sense, been sampled onto the points na, so we expect that whatever the momentum distribution is it'll be ambiguous modulo b where ab=ℏ. In fact, if we take the Fourier transform of the Dirac comb we get another Dirac comb. So in the frequency domain we get the same kind of phenomenon: the momentum is concentrated at integer multiples of b. So now we know we have a wavefunction whose uncertainty precisely fits the description I gave above. We know the position precisely modulo a and the momentum precisely modulo b. In some sense this isn't contrived: we know the momentum modulo b precisely because of the aliasing that results from knowing the position modulo a.

What this means

The message from this is that position-momentum uncertainty isn't fuzziness. At least it's not fuzziness in the ordinary sense of the word.

And in reality

I'm not very experienced in attaching numbers to results from theoretical physics so I'd find it hard to say how accurately we can create a Dirac comb state in reality. When we measure a position using interferometry techniques we automatically compute the position modulo a wavelength so this isn't an unusual thing to do. Also an electron in a periodic potential may take on a form that consists of a train of equally spaced lumps. Even if not described exactly by a Dirac comb, we can still know the position modulo a and the momentum modulo b much more accurately than you might expect from a naive interpretation of the Heisenberg uncertainty principle as fuzziness.

Exercises
1. Investigate approximations to the Dirac comb: eg. what happens if we sum only a finite number of Dirac deltas, or replace each delta with a finite width Gaussian, or both.
2. Investigate the "twisted" Dirac comb: ∑δ(x-an)exp(inθ) where θ is some constant.

Shuffles, Bayes' theorem and continuations.

2012-12-30T08:49:00.000-08:00

Introduction
Back in the 80s when I was a kid I came across a program for the BBC Micro that could tell what card you had picked from a deck of cards even though you'd buried your card within the deck wherever you wanted and had cut and shuffled the deck. I thought I'd try to implement a slightly more sophisticated version of the same trick that could handle multiple shuffles, and multiple types of shuffle.
The idea is that we prepare a deck of cards in some known sequence and have someone pick out a card and place it at the top of the deck. They perform some kind of randomisation procedure on the cards, eg. cut and shuffle it a couple of times, and then you get to look at the final sequence of cards. Can we tell which card was picked out?
Some probability theory
Let's formalise this a bit. Our decks will have cards. There is a small number of initial states our deck can be in, corresponding to the known sequence with a single card moved to the top. Let's label these initial states . There is a (usually large) number of permutations that could be applied through shuffling. We'll label these . We'll try to do arrange that this isn't simply the set of all permutations (though it's not necessarily a disaster if it is).
We want to figure out the initial state given some final state . In other words we want

We can use Bayes theorem to get:

Now is the sum over all ways of starting with and ending up with . So

where the sum is over all such that . I'm assuming that the shuffles are independent of the initial sequence of cards. This gives us an algorithm. We do a brute force simulation of every possible shuffle that we're considering applied to each possible initial state. After each shuffle we sum the corresponding probability for those shuffles that give our known final state .
Each shuffle is going to be built up as the product of a sequence of building blocks with each block randomly selected based on what happened before. Let's call the blocks names like . So if then . As we work through the shuffle we will accumulate the probability. After the first block we have a probability of . The probability after the second is and so on. At any point we'll call the probability accumulated so far the importance. I've borrowed that name from the world of rendering because this algorithm has a remarkable similarity to recursive ray-tracing.
Some computer science
I'd like to be able to chain a sequence of shuffles. But wait! There's a catch! Today's the day I finally want to get around to checking out the lambda expression support in C++. I've been putting this off for years. (I'm using gcc 4.7.) So I'm not going to have a Haskell non-determinism monad to make life easy.
Suppose I have two types of shuffle, type and type . I could easily write a loop to iterate over all shuffles of type , and in the innermost part of the loop I could call another loop over all shuffles of type . But then if I want to replace with I have to change the code to replace the inner part with code for . That's no good. I'd like to be able to replace the innermost part of the outer loop with any code I want without actually editing that part of the code. It's easy with lambda expressions. I write the type loop code so that it takes as argument a lambda function representing what I want done inside the loop.
There's another way of looking at this. You can skip this paragraph if you don't care about the connection to Haskell. But in Haskell you might do something like this by using a non-determinism monad, or even a probability monad. But as I pointed out a while back, you can fake every monad using the continuation monad. One way to implement continuations in C++ is to use continuation passing style. And that's what I'll do. The continuations are just the lambdas that I mentioned in the previous paragraph.
Some C++ code

> #include <iostream>
> #include <cstdlib>
> using namespace std;

You can bump this up the deck size if you have the CPU power:

> const int deck_size = 13;

A deck of cards is represented by a simple array of integers with each card being assigned a unique integer.

> struct Deck {
>     int card[deck_size];
>     bool operator==(const Deck &other) {
>         for (int i = 0; i < deck_size; ++i) {
>             if (card[i] != other.card[i]) {
>                 return false;
>             }
>         }
>         return true;
>     }
> };

The riffle shuffle works by splitting a deck into two piles and interleaving the parts onto a new destination deck. Here's a schematic diagram with the two piles coloured orange and blue:
The function riffle_helper helps loop through all possible riffles. I could assume that each card arriving at the destination is equally likely to come from the left pile or the right pile. But I observe that whenever I do a real riffle shuffle the cards seem to come in 'runs'. So if a card falls from the left pile then the next one is more likely to as well. That's just an empirical observation based on a small number of trials, you can tweak the probabilities yourself to fit reality better. (Oh, and I got this code upside-down compared to what people really do. I need to fix it when I have a moment...)

> enum Side {
>     LEFT,
>     RIGHT,
>     NO_SIDE
> };

This function shuffles together cards from the locations given by left_ptr and right_ptr in src_deck into dest_deck, eventually calling cont on each result. I use a template because I don't know the type of the lambda expression I'm passing in. (If I want to know its type I think I have to mess with decltype. It's all a bit weird.)

> template<class Cont>
> void riffle_helper(double importance, int split,
>                    int left_ptr, int right_ptr, int dest_ptr, Side oldside,
>                    const Deck &src_deck, Deck dest_deck, Cont cont) {
>     if (dest_ptr == deck_size) {
>         cont(importance, dest_deck);
>         return;
>     }

First I deal with the cases where one or other of the piles is empty so there's no choice about where the next card is coming from:

>     if (left_ptr >= split) {
>         dest_deck.card[dest_ptr] = src_deck.card[right_ptr];
>         riffle_helper(importance, split, left_ptr, right_ptr+1, dest_ptr+1, RIGHT, src_deck, dest_deck, cont);
>         return;
>     }
>     if (right_ptr >= deck_size) {
>         dest_deck.card[dest_ptr] = src_deck.card[left_ptr];
>         riffle_helper(importance, split, left_ptr+1, right_ptr, dest_ptr+1, LEFT, src_deck, dest_deck, cont);
>         return;
>     }
>     double p;
>     if (oldside == NO_SIDE) {
>         p = 0.5;
>     } else {
>         p = LEFT == oldside ? 0.75 : 0.25;
>     }
>     double new_importance = importance*p;
>     dest_deck.card[dest_ptr] = src_deck.card[left_ptr];
>     riffle_helper(new_importance, split, left_ptr+1, right_ptr, dest_ptr+1, LEFT, src_deck, dest_deck, cont);


>     if (oldside == NO_SIDE) {
>         p = 0.5;
>     } else {
>         p = RIGHT == oldside ? 0.75 : 0.25;
>     }
>     new_importance = importance*p;
>     dest_deck.card[dest_ptr] = src_deck.card[right_ptr];
>     riffle_helper(new_importance, split, left_ptr, right_ptr+1, dest_ptr+1, RIGHT, src_deck, dest_deck, cont);
> }

The function riffle iterates over all possible riffle shuffles of src_deck calling cont on each one. Note that I assume that when the deck is split into two before shuffling together, each pile has at least 3 cards. You may want to change that assumption.

> template<class Cont>
> void riffle(double importance, const Deck &src_deck, Cont cont) {
>     double new_importance = importance/(deck_size-5);
>     for (int split = 3; split < deck_size-2; ++split) {
>         riffle_helper(new_importance, split, 0, split, 0, NO_SIDE, src_deck, Deck(), cont);
>     }
> }

Iterate over all possible cuts of src_dec calling cont on each result. I assume the cut leaves at least 3 cards in each pile.

> template<class Cont>
> void cut(double importance, const Deck &src_deck, Cont cont) {
>     double new_importance = importance/(deck_size-5);
>     for (int split = 3; split < deck_size-2; ++split) {
>         Deck new_deck;
>         for (int i = 0; i < deck_size; ++i) {
>             if (i < deck_size-split) {
>                 new_deck.card[i] = src_deck.card[i+split];
>             } else {
>                 new_deck.card[i] = src_deck.card[i-(deck_size-split)];
>             }
>         }
>         cont(new_importance, new_deck);
>     }
> }

Overhand shuffle remaining cards in src_deck to dest_deck. Here's an attempt to represent what an overhand shuffle does. It reverses the order of a deck that has been split into segments. The order within each segment is left unchanged.

> template<class Cont>
> void overhand_helper(double importance, const Deck &src_deck,
>                      int cards_left, Deck dest_deck, Cont cont) {
>     if (cards_left <= 0) {
>         cont(importance, dest_deck);
>     } else {
>         double new_importance = importance/cards_left;
>         for (int ncards = 1; ncards <= cards_left; ++ncards) {
>             //
>             // Take i cards from the source and place them at the bottom of the
>             // destination.
>             //
>             for (int j = 0; j < ncards; ++j) {
>                 dest_deck.card[cards_left-ncards+j] = src_deck.card[deck_size-cards_left+j];
>             }
>             overhand_helper(new_importance, src_deck, cards_left-ncards, dest_deck, cont);
>         }
>     }
> }

Iterate over all possible overhand shuffles of cards in src_deck calling cont on each result. In practice I often find overhand shuffles result in cards mysteriously jumping segments and messing up the algorithm, whereas poorly executed riffle shuffles still work fine. I'm also assuming that each time a pile of cards is transferred the size of the pile is chosen uniformly from the set of all possible segments at that stage.

> template<class Cont>
> void overhand(double importance, const Deck &src_deck, Cont cont) {
>     overhand_helper(importance, src_deck, deck_size, Deck(), cont);
> }

The final code doesn't bother computing the denominator from Bayes' theorem. The most likely initial state is given by the one that results in the highest score. If you normalise the scores to sum to one you'll get actual probabilities.

> int main() {

This is the array representation of the cards in the following picture:

>     Deck target = {{ 10, 11, 6, 12, 1, 13, 8, 2, 9, 3, 5, 4, 7 }};

>     Deck deck;

Our known starting sequence is just 1, 2, 3, ..., J, Q, K. We iterate over all ways to pick a card out from this sequence and place it at the top.

>     for (int k = 0; k < deck_size; ++k) {
>         deck.card[0] = k+1;
>         for (int i = 1; i < deck_size; ++i) {
>             deck.card[i] = (i > k ? i : i-1)+1;
>         }
>         double likelihood = 0.0;

Here is where I use the lambdas. For this example I'm doing an overhand shuffle followed by a riffle shuffle. (The syntax is pretty bizarre and its also weird that I kinda sorta specify the type of my lambda but that's not really what the type of the expression is. But having manually faked and lifted lambdas many times in C++ I can see why it's the way it is.) Note how I've made likelihood mutable and have given these lambda expressions write access to it.

>         overhand(1.0, deck, [&likelihood, target](double importance, Deck &deck) -> void {
>            riffle(importance, deck, [&likelihood, target](double importance, Deck &deck) -> void {
>                if (deck == target) {

We sum the probabilities for all ways of generating the target deck:

>                    likelihood += importance;
>                }
>         }); });
>         cout << "If top card = " << deck.card[0] << endl;
>         cout << "then unnormalised probability = " << likelihood << endl;
>         cout << endl;
>     }


> }

Run the above code and you get unnormalised probabilities

If top card = 4
then unnormalised probability = 5.7568e-12
If top card = 6
then unnormalised probability = 5.37301e-11
If top card = 7
then unnormalised probability = 1.791e-11

In fact, I had chosen 6. Some discussion
Don't expect it to work perfectly! It can only give probabilities but it's often surprisingly good. But there is a lot of room for improvement. Some work looking at how people actually shuffle could give a better probabilistic model.
Some exercises.
1. The code can be made orders of magnitude faster. The final shuffle is performed and then the result is compared to the target sequence. But you can start comparing cards with the target before the shuffle is finished. Most times you'll only need to look at the first card of the result of a shuffle before you know you haven't matched the target. Fixing this will give a big speed up.
2. The continuation passing style makes it easy to incorporate other sources of knowledge. For example if you 'accidentally' peek at the bottom card after the first shuffle you can incorporate that knowledge into the algorithm. Figure out how.
3. Write lots more kinds of shuffles and experiment. I'm hoping someone good with magic will come up with a sequence of operations that looks hopelessly random but allows a good probability of recovering the chosen card. You could also combine this with other techniques such as designing shuffles that maintain various kinds of invariant.
4. The code can be rewritten to work backwards from the final state to the initial states. Work out how to do this. (This parallels ray-tracing where you can work from eye to light or from light to eye.)
5. We're doing the same work over and over again. We don't need to compute all of the shuffles for each initial state. We can compute each shuffle once and reuse it on each initial state. Try implementing it.