A Neighborhood of Infinity: 2017

Saturday, October 14, 2017

A tail we don't need to wag

Introduction

I've been reading a little about concentration inequalities recently. I thought it would be nice to see if you can use the key idea, if not the actual theorems, to reduce the complexity of computing the probability distribution of the outcome of stochastic simulations. Examples might include random walks, or queues.

The key idea behind concentration inequalities is that very often most of the probability is owned by a small proportion of the possible outcomes. For example, if we toss a fair coin enough (say $N$ ) times we expect the number of heads to lie within $2\sqrt{N}$ of the mean about 99.99% of the time despite there being $N+1$ different total numbers possible. The probable outcomes tend to concentrate around the expectation. On the other hand, if we consider not the total number of heads, but the possible sequences of $N$ tosses, there are $2^N$ possibilities, all equally likely. In this case there is no concentration. So a key ingredient here is a reduction operation: in this case reducing a sequence of tosses to a count of the number that came up heads. This is something we can use in a computer program.

I (and many others) have written about the "vector space" monad that can be used to compute probability distributions of outcomes of simulations and I'll assume some familiarity with that. Essentially it is a "weighted list" monad which is similar to the list monad except that in addition to tracking all possible outcomes, it also propagates a probability along each path. Unfortunately it needs to follow through every possible path through a simulation. For example, in the case of simulating $N$ coin tosses it needs to track $2^N$ different possiblities, even though we're only interested in the $N+1$ possible sums. If, after each bind operation of the monad, we could collect together all paths that give the same total then we could make this code much more efficient. The catch is that to collect together elements of a type the elements need to be comparable, for example instances of Eq or Ord. This conflicts with the type of Monad which requires that we can use the >>= :: m a -> (a -> m b) -> m b and return :: a -> m a functions with any types a and b.

I'm going to deal with this by adapting a technique presented by Oleg Kiselyov for efficiently implementing the Set monad. Instead of Set I'm going to use the Map type to represent probability distributions. These will store maps saying, for each element of a type, what the probability of that element is. So part of my code is going to be a direct translation of that code to use the Map type instead of the Set type.

> {-# LANGUAGE GADTs, FlexibleInstances #-}
> {-# LANGUAGE ViewPatterns #-}


> module Main where


> import Control.Monad
> import Control.Arrow
> import qualified Data.Map as M
> import qualified Data.List as L

The following code is very similar to Oleg's. But for first reading I should point out some differences that I want you to ignore. The type representing a probability distribution is P:

> data P p a where
>     POrd :: Ord a => p -> M.Map a p -> P p a
>     PAny :: p -> [(a, p)] -> P p a

But note how the constructors take two arguments - a number that is a probability, in addition to a weighted Map or list. For now pretend that first argument is zero and that the functions called trimXXX act similarly to the identity:

> instance (Ord p, Num p) => Functor (P p) where
>     fmap = liftM


> instance (Ord p, Num p) => Applicative (P p) where
>     pure = return
>     (<*>) = ap


> instance (Ord p, Num p) => Monad (P p) where
>     return x = PAny 0 [(x, 1)]
>     m >>= f = 
>         let (e, pdf) = unP m
>         in trimAdd e $ collect $ map (f *** id) pdf


> returnP :: (Ord p, Num p, Ord a) => a -> P p a
> returnP a = POrd 0 $ M.singleton a 1


> unP :: P p a -> (p, [(a, p)])
> unP (POrd e pdf) = (e, M.toList pdf)
> unP (PAny e pdf) = (e, pdf)


> fromList :: (Num p, Ord a) => [(a, p)] -> M.Map a p
> fromList = M.fromListWith (+)


> union :: (Num p, Ord a) => M.Map a p -> M.Map a p -> M.Map a p
> union = M.unionWith (+)


> scaleList :: Num p => p -> [(a, p)] -> [(a, p)]
> scaleList weight = map (id *** (weight *))


> scaleMap :: (Num p, Ord a) => p -> M.Map a p -> M.Map a p
> scaleMap weight = fromList . scaleList weight . M.toList

This is a translation of Oleg's crucial function that allows us to take a weighted list of probability distributions and flatten them down to a single probability distribution:

> collect :: Num p => [(P p a, p)] -> P p a
> collect []  = PAny 0 []
> collect ((POrd e0 pdf0, weight) : rest) =
>     let wpdf0 = scaleMap weight pdf0
>     in case collect rest of
>       POrd e1 pdf1 -> POrd (weight*e0+e1) $ wpdf0 `union` pdf1
>       PAny e1 pdf1 -> POrd (weight*e0+e1) $ wpdf0 `union` fromList pdf1
> collect ((PAny e0 pdf0, weight) : rest) =
>     let wpdf0 = scaleList weight pdf0
>     in case collect rest of
>       POrd e1 pdf1 -> POrd (weight*e0+e1) $ fromList wpdf0 `union` pdf1
>       PAny e1 pdf1 -> PAny (weight*e0+e1) $ wpdf0 ++ pdf1

But now I really must explain what the first argument to POrd and PAny is and why I have all that "trimming".

Even though the collect function allows us to reduce the number of elements in our PDFs, we'd like to take advantage of concentration of probability to reduce the number even further. The trim function keeps only the top $N$ probabilities in a PDF, discarding the rest. To be honest, this is the only point worth taking away from what I've written here :-)

When we throw away elements of the PDF our probabilities no longer sum to 1. So I use the first argument of the constructors as a convenient place to store the amount of probability that I've thrown away. The trim function keeps the most likely $N$ outcomes and sums the probability of the remainder. I don't actually need to keep track of what has been discarded. In principle we could reconstruct this value by looking at how much the probabilities in our trimmed partial PDFs fall short of summing to 1. But confirming that our discarded probability and our partial PDF sums to 1 gives a nice safety check for our code and can give us some warning if numerical errors start creeping in. I'll call the total discarded probability the tail probability.

Here is the core function to keep the top $N$ values. In this case $N$ is given by a global constant called trimSize. (I'll talk about how to do this better later.)

> trimList :: (Ord p, Num p) => [(a, p)] -> (p, [(a, p)])
> trimList ps =
>     let (keep, discard) = L.splitAt trimSize (L.sortOn (negate . snd) ps)
>     in (sum (map snd discard), keep)


> trimAdd :: (Ord p, Num p) => p -> P p a -> P p a
> trimAdd e' (POrd e pdf) =
>     let (f, trimmedPdf) = trimList (M.toList pdf)
>     in POrd (e'+e+f) (M.fromList trimmedPdf)
> trimAdd e' (PAny e pdf) =
>     let (f, trimmedPdf) = trimList pdf
>     in PAny (e'+e+f) trimmedPdf


> runP :: (Num p, Ord a) => P p a -> (p, M.Map a p)
> runP (POrd e pdf) = (e, pdf)
> runP (PAny e pdf) = (e, fromList pdf)

And now some functions representing textbook probability distributions. First the uniform distribution on a finite set. Again this is very similar to Oleg's chooseOrd function apart from the fact that it assigns weights to each element:

> chooseP :: (Fractional p, Ord p, Ord a) =>
>            [a] -> P p a
> chooseP xs = let p = 1/fromIntegral (length xs)
>              in POrd 0 $ fromList $ map (flip (,) p) xs

And the Bernoulli distribution, i.e. tossing a Bool coin that comes up True with probability $p$ :

> bernoulliP :: (Fractional p, Ord p) =>
>               p -> P p Bool
> bernoulliP p = POrd 0 $ fromList $ [(False, 1-p), (True, p)]

Now we can try a random walk in one dimension. At each step we have a 50/50 chance of standing still or taking a step to the right:

> random_walk1 :: Int -> P Double Int
> random_walk1 0 = returnP 0
> random_walk1 n = do
>     a <- random_walk1 (n-1)
>     b <- chooseP [0, 1]
>     returnP $ a+b

Below in main we take 2048 steps but only track 512 probabilities. The tail probability in this case is about $10^{-29}$ . So only tracking 1/4 of the outcomes has had almost no impact on the numbers. This also illustrates why it is good to track the tail probabilities rather than inferring them from the missing probabilities in the bulk of the PDF - they can be so small they vanish compared to floating poimnt errors. We can afford to track a lot fewer than 512 (out of 2049 possible) outcomes and still have a good representative PDF.

Now here's a two-dimensional random walk for 32 steps. The tail probability is about $2\times 10^{-5}$ so we are getting a reasonably representative PDF. We have to run fewer steps than before, however, because the space of possible outcomes spans two dimensions, meaning that reduction doesn't help as much as it does in one dimension.

> random_walk2 :: Int -> (Int, Int) -> P Double (Int, Int)
> random_walk2 0 (x, y) = returnP (x, y)
> random_walk2 n (x, y) = do
>     (x',y') <- random_walk2 (n-1) (x, y)
>     dx <- chooseP [-1, 1]
>     dy <- chooseP [-1, 1]
>     returnP (x'+dx, y'+dy)

One last simulation. This is a queing scenario. Tasks come in once every tick of the clock. There are four queues a task can be assigned to. A task is assigned to the shortest queue. Meanwhile each queue as a 1/4 probability of clearing one item at each tick of the clock. We build the PDF for the maximum length any queue has at any time.

The first argument to queue is the number of ticks of the clock. The second argument is the list of lengths of the queues. It returns a PDF, not just on the current queue size, but also on the longest queue it has seen.

> queue :: Int -> [Int] -> P Double (Int, [Int])
> queue 0 ls = returnP (maximum ls, ls)
> queue n ls = do
>     (longest, ls1) <- queue (n-1) ls
>     ls2 <- forM ls1 $ \l -> do
>         served <- bernoulliP (1/4)
>         returnP $ if served && l > 0 then l-1 else l
>     let ls3 = L.sort $ head ls2+1 : tail ls2
>     returnP (longest `max` maximum ls3, ls3)

For the queing simulation the tail probability is around $5\times 10^{-4}$ despite the fact that we have discarded a vast possible set of possible outcomes.

It's a little ugly that trimSize is a global constant:

> trimSize = 512

The correct solution is probably to separate the probability "syntax" from its "semantics". In other words, we should implement a free monad supporting the language of probability with suitable constructors for bernoulliP and choiceP. We can then write a separate interpreter which takes a trimSize as argument. This has another advantage too: the Monad above isn't a true monad. It uses a greedy approach to discarding probabilities and different rearrangements of the code, that ought to give identical results, may end up diferent. By using a free monad we ensure that our interface is a true monad and we can put the part of the code that breaks the monad laws into the interpreter. The catch is that my first attempt at writing a free monad resulted in code with poor performance. So I'll leave an efficient version as an exercise :-)

> main = do
>     print $ runP $ random_walk1 2048
>     print $ runP $ random_walk2 32 (0, 0)
>     print $ runP $ do
>         (r, _) <- queue 128 [0, 0, 0, 0]
>         returnP r

Friday, August 11, 2017

What is a photon?

Introduction

Popular science writing about quantum mechanics leaves many people full of questions about the status of photons. I want to answer some of these without using any tricky mathematics.

One of the challenges is that photons are very different to ordinary everyday objects like billiard balls. This is partly because photons are described by quantum mechanics whereas billiard balls are better modelled with classical Newtonian mechanics. Quantum mechanics defies many of our intuitions. But it's also because the word photon plays by different linguistic rules to billiard ball. I hope to explain why.

One of my goals is to avoid saying anything original. I'm largely going remove the mathematics from material I first learnt from three or so courses I took at Cambridge University many years ago: Quantum Mechanics, Solid State Physics and Quantum Field Theory. I also learnt about some of this from David Miller at Stanford University who talked a little about what properties it is meaningful to apply to a photon. (I hope I haven't misrepresented him too badly.)

The simple harmonic oscillator

Here's a mass hanging on a spring:

Suppose it's initially sitting in equilibrium so that the net force acting on it is zero. Now we lift the mass a small distance and let it go. Because we lifted it, we shortened the spring, reducing its tension. This means the force due to gravity is now more than the spring tension and the mass falls. Eventually it falls below the equilibrium point, increasing the tension in the spring so there is a net force pulling it back up again. To a good approximation, the force restoring the mass to its equilibrium point is proportional to how far it has been displaced. When this happens we end up with oscillating motion where the mass bounces up and down. Here's what a graph of its displacement looks like over time:

It's actually a sine wave but that detail doesn't matter for us right now.

An oscillator where the restoring force is proportional to the displacement from the equilibrium point is called a simple harmonic oscillator and its oscillation is always described by a sine wave.

Note that I'm ignoring friction here. This is a reasonable approximation for many physical systems.

Masses on springs aren't all that important in themselves. But simple harmonic oscillators are very common. Another standard example is the pendulum swinging under the influence of gravity:

At a more fundamental level, an example might be an atom in a crystal being held in place by electrostatic forces from its neighbouring atoms.

If you have one of these systems, then in principle you can set it in motion with as little energy as you like. Pull a mass on a spring down a little bit and it will bounce back up, oscillating a certain amount. Pull the mass down half the amount and it'll bounce with oscillations half the size. In principle we could keep repeating this experiment, each time starting with the mass displaced half the amount we tried previously. In other words, a simple harmonic oscillator can have any energy we like. The spectrum of possible energies of one of these oscillators is continuous. (Note that the word spectrum here is merely physicist-speak for a set of possible values.) If we can set one in motion with 1 unit of energy then we can also set it oscillating with 0.5 units, or 0.01 units, or 0.000123 units of energy.

Quantum mechanics

Everything I've said above is assuming that classical Newtonian mechanics is valid. But we know that for very small systems, around the size of a few atoms or smaller, we need to use quantum mechanics. This is an enormous topic but I'm only going to extract one basic fact. According to quantum mechanics, a simple harmonic oscillator isn't free to oscillate with any energy you like. The possible energy levels, the spectrum of the system, is discrete. There is a lowest energy level, and then all of the energy levels above that are equally spaced like so, going up forever:

We usually call the lowest energy level the ground state or vacuum state and call the higher levels excited states.

The spacing of the energy levels depends on the stiffness of the system, which is just a measure of how much the restoring force increases with displacement from equilibrium. Stiffer systems will have a higher frequency of oscillation and a bigger spacing between the energy levels.

(I'm deliberately not saying anything about why we get discrete energy levels in quantum mechanics. I just want to use this one fact so I can get on and talk about photons eventually.)

In practice the difference in energy between one level and the next is tiny. This means that if you're literally fiddling about with a mass on a spring you won't ever feel the discreteness. The amount your hand trembles is many orders of magnitude greater than the effect of this discreteness. Nonetheless, it is extremely important when modeling microscopic systems.

Quantum linguistics

Here are some English sentences we could say about the kinds of systems I've described so far:

This system is in the ground state.
That system is in its first excited state
This system is at an energy level higher than that system
After allowing these two similar oscillators to interact, the energy level of this oscillator went down and the energy level of that one went up by the same amount.

Now I want to introduce the (count) noun quantum, with plural quanta. The idea here is not that I'm telling you about a new entity. I want to present this as a new way to talk about things I've already introduced. So rather than give a definition of quantum I will instead show how you can rewrite the above sentences using the language of quanta:

There are no quanta in this system
That system has one quantum of energy
This system has more quanta than that one
Some quanta were transferred from this system to that system.

Those sentences make it seem like I'm talking about a new kind of object - the quantum. But I'm not. They're just a manner of speaking about energy levels. I hope I've given you enough examples to get the idea.

Just in case you think it's weird to talk about energy levels in terms of quanta, I'd like to remind you that you already do this all the time with money. Dollar bills are actual objects that exist in the world. But money in your bank account isn't. Somewhere in some database is a representation of how much money you have. You might say "I have one hundred dollars in my savings account" But those dollars certainly don't exist as distinct entities. It doesn't really make sense to talk about the thirty-seventh dollar in your bank account. You can transfer dollars from one account to another, and yet what's really happening is that two totals are being adjusted. We treat these accounts a lot like they're containers holding individual objects called dollars. Certainly our language is set up like that. But we know that it's really just the totals that have any kind of representation. The same goes for quanta. It's just a manner of speaking about systems that can have different amounts of energy and where the spectrum of energy levels forms a ladder with equally spaced rungs. Because of your experience with money I probably don't need to give you any more examples.

One more bit of terminology: when the spectrum of energies is discrete it's said to be quantised.

Coupled systems

Let's return to classical physics with a slightly more complex system consisting of two masses connected to springs. We ignore gravity now:

We restrict ourselves to just considering back and forth motion constrained along a horizontal line. This is a coupled system. If the left mass moves to the right, not only does it experience a restoring force pushing it left, but the mass on the right will experience more of a force pushing it to the left. We can't treat the masses as independent and so we don't get the simple solution of each mass always oscillating with a sine wave.

For this particular problem though there's a trick to turn it into a pair of harmonic oscillators. The idea is to consider the pair of masses as a single entity. We can think of the motion centre of mass of the pair, the midpoint between them, as being one variable that describes this entity. Let's call its motion the external motion. We can also think of the distance between the two masses in the pair as being the system's internal motion. (I'm just using internal and external as convenient names. Don't read too much into them.) It turns out that when you analyse this using classical dynamics the internal motion and the external motion act like independent quantities. What's more, each one behaves exactly like it's simple harmonic. So we get one sine wave describing the overall motion of the pair, and another one that describes how the elements of the pair oscillate with respect to each other.

The frequencies of the internal and external motions are typically different. So you can end up with some quite complicated motions with two different frequencies beating against each other.

When we're able to find ways to split up the motion into independent quantities, each of which is simple harmonic, each kind of motion is said to be a normal mode.

When you have independent normal modes, you can treat them independently in quantum mechanics too. So what we get is that the spectrum of possible energy levels for this system is, in some sense, two-dimensional. We can put quanta into the internal oscillation and we can also put quanta into the external oscillation. Because these modes have different frequencies the quanta for each mode correspond to different amounts of energy.

(And a reminder: when talking about quantum mechanics I'm not talking literally about masses on springs. I'm talking about physical systems that have equations of motion that mean they behave like masses on springs. In this case it might be a pair of particles trapped in a microscopic well with a repulsive force between them.)

Solid state physics

Now I'm going to jump from just two masses to a large number of them. For example, the behavior of trillions of atoms in a solid crystal can be approximately modelled by a grid of masses and springs, of which the following diagram is just a tiny piece:

A real crystal would be arranged in a 3D lattice but I've drawn 2D here for convenience.

Think of the springs as both pushing apart atoms that get close, and pulling together atoms that move apart.

This is a highly coupled system. Ultimately every atom in our lattice is connected to every other one, either directly, or indirectly. Nonetheless, it is still possible to find normal modes. The normal modes all have the same basic form: they are all sinusoidal waves of displacement traveling in some direction with some speed and oscillation frequency. Each of these modes consists of waves that extend through the entire crystal, with fixed spacing between parallel planar wavefronts. This type of waves is known as a plane wave. If the system is perfectly harmonic, so the restoring force is precisely proportional to the displacement, then each direction and frequency of wave oscillates its way through the crystal completely independently of any other. Just as how in the example with two masses any possible oscillation is a combination of internal and external motion, for a crystal lattice any motion is a combination of these plane waves. (Decomposing any oscillation as a combination of plane waves is known as computing its Fourier transform.

Now we're ready to consider this situation quantum mechanically. Because each plane wave is a normal mode, we can treat each one as an independent simple harmonic oscillator. This means that the energy in each plane wave is quantised. So when we consider a crystal lattice quantum mechanically we find that its states consist of plane waves propagating through it, but where the amount of energy in each wave is given by a discrete spectrum. So again we can talk about how many quanta there are in each mode.

Linguistically it gets a bit more interesting now. Each plane wave is associated with a particular direction and speed so it makes sense to talk of these quanta as having a direction and speed. But note that statements involving quanta are still really just sentences about energy levels. So, for example, the statement "the mode of this system with this velocity and frequency is in its first excited state" is, by definition, exactly the same as "this system has precisely one quantum with this velocity and frequency". In particular, when we write sentences like these we aren't implying that there is some new kind of object, the quantum, that has suddenly attached itself to our crystal. The quanta are properties of the lattice. By the way, in the particular case of vibrating atoms in a lattice, the quanta are known by a special name: phonons.

Quantum field theory and photons

And now we're ready to move onto photons.

In classical physics, electromagnetism is described by Maxwell's equations. Maxwell's equations say that a varying magnetic field generates an electric field and a varying electric field generates a magnetic field. The result is that it is possible for an oscillating electric field to create an oscillating electric field so that an electric field can propagate through space on its own without the help of electric charges or electric currents or any other kind of `generator'. As these electric fields also produce magnetic fields that propagate with them, the whole thing is called an electromagnetic wave.

Just like displacements in a crystal lattice, an electromagnetic wave also has normal modes. The normal modes are plane waves traveling at the speed of light in a particular directions with a given frequency. You have personal experience of this. Visible light is electromagnetic radiation with a frequency of around 500 THz. Wifi uses signals at around 5 GHz. The radio might use signals at around 100 MHz. When you surf the web wirelessly while listening to the radio, the wifi signals don't interfere with your vision or the radio signal. (Actually, wifi might interfere with the radio signals, but not because of the 5 GHz signals. It might happen if badly manufactured hardware emits stray signals around the 100 MHz band.) That's because these waves pass through each other without being coupled to each other in any way. And at this point you might already be guessing what a photon is. For each choice of frequency and direction (and also polarisation, but that's just a detail) the amount of energy that can be in the corresponding mode is quantised. For the electromagnetic field the quanta are called photons.

And that's it!

Electromagnetic waves can be thought of as being made up of different oscillation modes. Because of quantum mechanics, each mode contains an amount of energy that is quantised to be a whole number multiple of some base amount. Although the thing that really matters is the total amount of energy in the modes, it can still be useful to talk about this total as if it's a collection of entities called photons.

One thing to notice is that the normal modes for an electromagnetic wave are plane waves that are extended in space. In principle all the way across the universe but for practical problems physicists often consider electromagnetic waves in a large but finite box. This means that adding a quantum to a system has an effect that extends across the entire system. That makes it problematic to talk about the location of a photon.

Caveat

Physicists sometimes use the word photon in slightly different but related ways. I've described what I think of as the core definition as presented in many courses on quantum field theory.

Acknowledgements

Thanks to @dmoore2718 for encouraging me to edit this document down to a better size.

Saturday, July 15, 2017

Self-referential logic via self-referential circuits

Introduction

TL;DR The behaviour of a certain kind of delay component has a formal similarity to Löb's theorem which gives a way to embed part of provability logic into electronic circuits.

Here's a famous paradoxical sentence:

This sentence is false

If it's false then it's true and if it's true then it's false.

Here's a paradoxical electronic circuit:

The component in the middle is an inverter. If the output of the circuit is high then its input is high and then it's output must be low, and vice versa.

There's a similarity here. But with a bit of tweaking you can turn the similarity into an isomorphism of sorts.

In the first case we avoid paradox by noting that in the mathematical frameworks commonly used by mathematicians it's impossible, in general, for a statement to assert it's own falsity. Instead, a statement can assert its own unprovability and then we get Gödel's incompleteness theorems and a statement that is apparently true and yet can't be proved.

In the second case we can't model the circuit straightforwardly as a digital circuit. In practice it might settle down to a voltage that lies between the official high and low voltages so we have to model it as an analogue circuit. Or instead we can introduce a clock and arrange that the feedback in the circuit is delayed. We then get an oscillator circuit that can be thought of as outputting a stream of bits.

The observation I want to make is that if the feedback delay is defined appropriately, these two scenarios are in some sense isomorphic. This means that we can model classic results about provability, like Gödel's incompleteness theorems, using electronic circuits. We can even use such circuits to investigate what happens when logicians or robots play games like Prisoner's Dilemma. I'll be making use of results found in Boolos' book on The Logic of Provability and some ideas I borrowed from Smoryński's paper on Fixed Point Algebras. I'll be assuming the reader has at least a slight acquaintance with ithe ideas behind provability logic.

Provability Logic

There are many descriptions of provability logic (aka GL) available online, so I'm not going to repeat it all here. However, I've put some background material in the appendix below and I'm going to give a very brief reminder now.

Start with (classical) propositional calculus which has a bunch of variables with names like $a, b, c, d, \ldots$ and connectives like $\wedge$ for AND, $\vee$ for OR, $\neg$ for NOT and $\rightarrow$ for implication. (Note that $a\rightarrow b = \neg a\vee b$.)

Provability logic extends propositional calculus by adding a unary operator $\Box$. (I apologise, that's meant to be a □ but it's coming out like $\Box$ in LaTeX formulae. I think it's a bug in Google's LaTeX renderer.) The idea is that $\Box p$ asserts that $p$ is provable in Peano Arithmetic, aka PA. In addition to the axioms of propositional calculus we have

$\Box(p\rightarrow q)\rightarrow\Box p\rightarrow\Box q$

and

$\Box p\rightarrow\Box\Box p$

as well as a rule that allows us to deduce $\Box p$ from $p$.

We also have this fixed point property:

Let $F(p)$ be any predicate we can write in the language of GL involving the variable $p$, and suppose that every appearance of $p$ in $F(p)$ is inside a $\Box$, e.g. $F(p)=\Box p\vee\Box(\neg p)$. Then there is a fixed point, i.e. a proposition $q$ that makes no mention of $p$ such that $q\leftrightarrow F(q)$ is a theorem. In effect, for any such $F$, $q$ is a proposition that asserts $F(q)$.

See the appendix for a brief mention of why we should expect this to be true.

From the fixed point property we can deduce Löb's theorem: $\Box(\Box p\rightarrow p)\rightarrow\Box p$. There is a proof at wikipedia that starts from the fixed point property.

We can also deduce the fixed point property from Löb's theorem so it's more usual to take Löb's theorem as an axiom of GL and show that the fixed point property follows. You can think of Löb's theorem as a cunning way to encode the fixed point property. In fact you can argue that it's a sort of Y-combinator, the function that allows the formation of recursive fixed points in functional programming languages. (That's also, sort of, the role played by the loeb function I defined way back. But note that loeb isn't really a proof of Löb's theorem, it just has formal similarities.)

Back to electronic circuits

In order to make digital circuits with feedback loops well-behaved I could introduce a circuit element that results in a delay of one clock cycle. If you insert one of these into the inverter circuit I started with you'll end up with an oscillator that flips back and forth between 0 and 1 on each clock cycle. But I want to work with something slightly stricter. I'd like my circuits to eventually stop oscillating. (I have an ulterior motive for studying these.) Let me introduce this component:

It is intended to serve as a delayed latch and I'll always have the flow of data being from left to right. The idea is that when it is switched on it outputs 1. It keeps outputting 1 until it sees a 0 input. When that happens, then on the next clock cycle its output drops to 0 and never goes back up to 1 until reset.

Because the output of our delay-latch isn't a function of its current input, we can't simply describe its operation as a mathematical function from $\{0,1\}$ to $\{0,1\}$. Instead let's think of electronic components as binary operators on bitstreams, i.e. infinite streams of binary digits like ...00111010 with the digits emerging over time starting with the one written on the right and working leftwards. The ordinary logic gates perform bitwise operations which I'll represent using the operators in the C programming language. For example,

...001110 & ...101010 = ...001010

and

~...101 = ...010

and so on. Let's use □ to represent the effect of latch-delay on a bitstream. We have, for example,

□...000 = ...001

and

□...11101111 = ...00011111.

The operator □ takes the (possibly empty) contiguous sequence of 1's at the end of the bitstream, extends it by one 1, and sets everything further to the left to 0. If we restrict ourselves to bitstreams that eventually become all 0's or all 1's on the left, then bitstreams are in one-to-one correspondence with the integers using the twos complement representation. For example ...111111, all 1's, represents the number -1. I'll simply call the bistreams that represent integers integers. With this restriction we can use a classic C hacker trick to write □p=p^(p+1) where ^ is the C XOR operator. The operator □ outputs the bits that get flipped when you add one.

Let's use the symbol → so that a → b is shorthand for ~a|b. Here are some properties of □:

1. □(-1) = -1

2. □p → □□p = -1

3. □(p → q) → □p → □q = -1

In addition we have the fixed point property:

Let F(p) be any function of p we can write using □ and the bitwise logical operators and such that all occurrences of p occur inside □. Then there is a unique bitstream q such that q=F(q).

We can make this clearer if we return to circuits. F(p) can be thought of as a circuit that takes p as input and outputs some value. We build the circuit using only boolean logic gates and delay-latch. We allow feedback loops, but only ones that go through delay-latches. With these restrictions it's pretty clear that the circuit is well-behaved and deterministically outputs a bitstream.

We also have the Löb property:

4. □(□p → p) → □p = -1

We can see this by examining the definition of □. Intuitively it says something like "once □ has seen a 0 input then no amount of setting input bits to 1 later in the stream make any different to its output".

I hope you've noticed something curious. These properties are extremely close to the properties of $\Box$ in GL. In fact, these electronic circuits form a model of the part of GL that doesn't involve variable names, i.e. what's known as letterless GL. We can formalise this:

1. Map $\bot$ to a wire set to 0, which outputs ...000 = 0.

2. Map $\top$ to a wire set to 1, which outputs ...111 = -1.

3. Map $p \circ q$, where $\circ$ is a binary connective, by creating a circuit that takes the outputs from the circuits for $p$ and $q$ and passes them into the corresponding boolean logic gate.

4. Map $\Box p$ to the circuit for $p$ piped through a delay-latch.

For example, let's convert $\Box(\Box\bot\rightarrow\bot)\rightarrow\Box\bot$ into a circuit. I'm translating $a\rightarrow b$ to the circuit for $\neg a\vee b$.

I'm using red wires to mean wires carrying the value 1 rather than 0. I hope you can see that this circuit eventually settles into a state that outputs nothing but 1s.

We have this neat result:

Because delay-latch satisfies the same equations as $\Box$ in provability logic, any theorem, translated into a circuit, will produce a bistream of just 1s, i.e. -1.

But here's a more surprising result: the converse is true.

If the circuit corresponding to a letterless GL proposition produces a bistream of just 1s then the proposition is actually a theorem of GL.

I'm not going to prove this. (It's actually a disguised form of lemma 7.4 on p.95 of Boolos' book.) In the pictured example we got ...1111, so the circuit represents a theorem. As it represents Löb's theorem for the special case $p=\bot$ we should hope so. More generally, any bitstream that represents an integer can be converted back into a proposition that is equivalent to the original proposition. This means that bitstreams faithfully represent propositions of letterless GL. I'm not going to give the translation here but it's effectively given in Chapter 7 of Boolos. I'll use $\psi(p)$ to represent the translation from propositions to bitstreams via circuits that I described above. Use $\phi(b)$ to represent the translation of bitstream $b$ back into propositions. We have $p\leftrightarrow\phi(\psi(p))$. But I haven't given a full description of $\phi$ and I haven't proved here that it has this property.

Circuits with feedback

In the previous section I considered letterless propositions of GL. When these are translated into circuits they don't have feedback loops. But we can also "solve equations" in GL using circuits with feedback. The GL fixed point theorem above says that we can "solve" the equation $p\leftrightarrow F(p)$, with one letter $p$, to produce a letterless proposition $q$ such that $q\leftrightarrow F(q)$. Note here that $p$ is a letter in the language of GL. But I'm using $q$ to represent a proposition in letterless GL. If we build a circuit to represent $F$, and feed its output back into where $p$ appears, then the output bitstream represents the fixed point. Here's a translation of the equation $p \leftrightarrow \neg(\Box p \vee \Box\Box\Box p)$:

I'll let you try to convince yourself that such circuits always eventually output all 0's or all 1's. When we run the circuit we get the output ...1111000 = -8. As this is not -1 we know that the fixed point isn't a theorem. If I'd defined $\phi$ above you could use it to turn the bitstream back into a proposition.

The same, syntactically (optional section)

I have a Haskell library on github for working with GL: provability. This uses a syntactic approach and checks propositions for theoremhood using a tableau method. We can use it to analyse the above example with feedback. I have implemented a function, currently called value', to perform the evaluation of the bitstream for a proposition. However, in this case the fixedpoint function computes the fixed point proposition first and then converts to a bitstream rather than computing the bitstream directly from the circuit for F:

> let f p = Neg (Box p \/ Box (Box (Box p)))
> let Just p = fixedpoint f
> p
Dia T /\ Dia (Dia T /\ Dia (Dia T /\ Dia T))
> value' p
-8

(Note that Dia p means $\Diamond p = \neg\Box\neg p$.)

The function fixedpoint does a lot of work under the hood. (It uses a tableau method to carry out Craig interpolation.) The circuit approach requires far less work.

Applications

1. Programs that reason about themselves

In principle we can write a program that enumerates all theorems of PA. That means we can use a quine trick to write a computer program that searches for a proof, in PA, of its own termination. Does such a program terminate?

We can answer this with Löb's theorem. Let $p =$ "The program terminates". The program terminates if it can prove its termination. Formally this means we assume $\Box p\rightarrow p$. Using one of the derivation rules of GL we get $\Box(\Box p\rightarrow p)$. Löb's theorem now gives us $\Box p$. Feed that back into our original hypothesis and we get $p$. In other words, we deduce that our program does in fact terminate. (Thanks to Sridhar Ramesh for pointing this out to me.)

But we can deduce this using a circuit. We want a solution to $p\leftrightarrow \Box p$. Here's the corresponding circuit:

It starts by outputting 1's and doesn't stop. In other words, the fixed point is a theorem. And that tells us $p$ is a theorem. And hence that the program terminates.

2. Robots who reason about each others play in Prisoner's Dilemma

For the background to this problem see Robust Cooperation in the Prisoner's Dilemma at LessWrong. We have two robot participants $A$ and $B$ playing Prisoner's Dilemma. Each can examine the other's source code and can search for proofs that the opponent will cooperate. Suppose each robot is programmed to enumerate all proofs of PA and cooperate if it finds a proof that its opponent will cooperate. Here we have $p =$ "A will cooperate" and $q =$ "B will cooperate". Our assumptions about the behaviour of the robots are $p \leftrightarrow \Box q$ and $q \leftrightarrow \Box p$, and hence that $p \leftrightarrow \Box\Box p$. This corresponds to the circuit:

This outputs ...1111 = -1 so we can conclude $p$ and hence that these programs will cooperate. (Note that this doesn't work out nicely if robot B has a program that doesn't terminate but whose termination isn't provable in the formal system A is using. That means this approach is only good for robots that want to cooperate and want to confirm such cooperation. See the paper for more on this.)

At this point I really must emphasise that these applications are deceptively simple. I've shown how these simple circuits can answer some tricky problems about provability. But these aren't simply the usual translations from boolean algebra to logic gates. They work because circuits with delay-latch provide a model for letterless provability logic and that's only the case because of a lot of non-trivial theorem proving in Boolos that I haven't reproduced here. You're only allowed to use these simple circuits once you've seen the real proofs :-)

Things I didn't say above

1. I described the translation from propositions to circuits that I called $\psi$ above. But I didn't tell you what $\phi$ looks like. I'll leave this as an exercise. (Hint: consider the output from the translation of $\Box^n\bot$ into a circuit.)

2. The integers, considered as bistreams, with the bitwise operators, and the unary operator □p=p^(p+1), form an algebraic structure. For example, if we define ⋄p=~□~p we have a Magari algebra. Structures like these are intended to capture the essential parts of self-referential arguments in an algebraic way.

3. Because of the interpretation of □ as a delayed latch in a circuit you could view it as saying "my input was always true until a moment ago". This surely embeds provability logic in a temporal logic of some sort.

4. (Deleted speculations about tit-for-tat that need rethinking.)

5. For even the most complex letterless proposition in Boolos you could check its theoremhood with a pretty small circuit. You could even consider doing this with a steam powered pneumatic circuit. I had to say that to fulfil a prophecy and maintain the integrity of the timeline.

Appendix on provability

The modern notion of a proof is that it is a string of symbols generated from some initial strings called "axioms" and some derivation rules that make new strings from both axioms and strings you've derived previously. Usually we pick axioms that represent "self-evident" truths and we pick derivation rules that are "truth-preserving" so that every proof ends at a true proposition of which it is a proof. The derivation rules are mechanical in nature: things like "if you have this symbol here and that symbol there then you can replace this symbol with that string you derived earlier" etc.

You can represent strings of symbols using numbers, so-called Gödel numbers. Let's pick a minimal mathematical framework for working with numbers: Peano Arithmetic, aka PA. Let's assume we've made some choice of Gödel numbering scheme and when $p$ is a proposition, write $[p]$ for the number representing $p$. You can represent the mechanical derivation rules as operations on numbers. And that makes it possible to define a mathematical predicate $Prov$ that is true if and only if its argument represents a provable proposition.

In other words, we can prove $Prov([p])$ using PA if and only if $p$ is a proposition provable in PA.

The predicate $Prov$ has some useful properties:

1.If we can prove $p$, then we can prove $Prov([p])$.

We take the steps we used to prove $p$, and convert everything to propositions about numbers. If $Prov$ is defined correctly then we can convert that sequence of numbers into a sequence of propositions about those numbers that makes up a proof of $Prov(p)$.

2.$Prov([p\rightarrow q])$ and $Prov([p])$ imply $Prov([q])$

A fundamental step in any proof is modus ponens, i.e. that $p\rightarrow q$ and $q$ implies $p$. If $Prov$ does its job correctly then it had better know about this.

3.$Prov([p])$ implies $Prov([Prov([p])])$

One way is to prove this is to use Löb's theorem.

4. $Prov([\top])$

The trivially true statement had better be provable or $Prov$ is broken.

Constructing $Prov$ is conceptually straightforward but hard work. I'm definitely not going to do it here.

And there's one last thing we need: self-reference. If $p$ is a proposition, how can we possibly assert $Prov([p])$ without squeezing a copy of $[p]$ inside $p$? I'm not going to do that here either - just mention that we can use a variation of quining to achieve this. That allows us to form a proposition $p$ for which we can prove $p\leftrightarrow Prov([p])$. In fact, we can go further. We can find propositions that solve $p\leftrightarrow F(p)$ for any predicate $F(p)$ built from the usual boolean operations and $p$ as long as all of the occurrences of $p$ are inside the appearances of $Prov$. Even though we can't form a proposition that directly asserts its own falsity, we can form one that asserts that it is unprovable, or one that asserts that you can't prove that you can't prove that you can prove it, or anything along those lines.

Anyway, all that $[]$ and $Prov$ business is a lot of hassle. Provability logic, also known as GL, is intended to capture specifically the parts of PA that relate to provability. GL is propositional calculus extended with the provability operator $\Box$. The intention is that if $p$ is a proposition, $\Box p$ is a proposition in GL that represents $Prov([p])$ in PA. The properties of $Prov$ above become the axioms and derivation rules of GL in the main text.

Tuesday, June 06, 2017

A relaxation technique

Introduction

Sometimes you want to differentiate the expected value of something. I've written about some tools that can help with this. For example you can use Automatic Differentiation for the derivative part and probability monads for the expectation. But the probability monad I described in that article computes the complete probability distribution for your problem. Frequently this is intractably large. Instead people often use Monte Carlo methods. They'll compute the "something" many times, substituting pseudo-random numbers for the random variables, and then average the results. This provides an estimate of the expected value and is ubiquitous in many branches of computer science. For example it's the basis of ray-tracing and path-tracing algorithms in 3D rendering, and plays a major role in machine learning when used in the form of stochastic gradient descent.

But there's a catch. Suppose we want to compute $E(f(X_1,X_2,\ldots,X_n))$ where each of the $X_i$ belong to the Bernoulli distribution $B(p)$ . I.e. each $X_n$ has a probability $p$ of being 1 and probability $1-p$ of being 0. If we compute this using a Monte Carlo approach we'll repeatedly generate pseudo-random numbers for each of the $X_i$ . Each one will be 0 or 1. This means that our estimate depends on $p$ via subexpressions that can't meaningfully be differentiated with respect to $p$ . So how can we use automatic differentiation with the Monte Carlo method? I'm proposing an approach that may or may not already be in the literature. Whether it is or not, I think it's fun to get there by combining many of the things I've previously talked about here, such as free monads, negative probabilities and automatic differentiation. I'm going to assume you're familiar with using dual numbers to compute derivatives as I've written about this before and wikipedia has the basics.

A probability monad

I want to play with a number of different approaches to using monads with probability theory. Rather than define lots of monads I think that the easiest thing is to simply work with one free monad and then provide different interpreters for it.

First some imports:

> import Control.Monad
> import qualified System.Random as R
> import qualified Data.Map.Strict as M

I'm going to use a minimal free monad that effectively gives us a DSL with a new function that allows us to talk about random Bernoulli variables:

> data Random p a = Pure a | Bernoulli p (Int -> Random p a)

The idea is that Pure a represents the value a and Bernoulli p f is used to say "if we had a random value x, f x is the value we're interested in". The Random type isn't going to do anything other than represent these kinds of expressions. There's no implication that we actually have a random value for x yet.

> instance Functor (Random p) where
>     fmap f (Pure a) = Pure (f a)
>     fmap f (Bernoulli p g) = Bernoulli p (fmap f . g)


> instance Applicative (Random p) where
>     pure = return
>     (<*>) = ap


> instance Monad (Random p) where
>     return = Pure
>     Pure a >>= f = f a
>     Bernoulli p g >>= f = Bernoulli p (\x -> g x >>= f)

We'll use bernoulli p to represent a random Bernoulli variable drawn from $B(p)$ .

> bernoulli :: p -> Random p Int
> bernoulli p = Bernoulli p return

So let's write our first random expression:

> test1 :: Random Float Float
> test1 = do
>     xs <- replicateM 4 (bernoulli 0.75)
>     return $ fromIntegral $ sum xs

It sums 4 Bernoulli random variables from $B(\frac{3}{4})$ and converts the result to a Float. The expected value is 3.

We don't yet have a way to do anything with this expression. So let's write an interpreter that can substitute pseudo-random values for each occurrence of bernoulli p:

It's essentially interpreting our free monad as a state monad where the state is the random number seed:

> interpret1 :: (Ord p, R.Random p, R.RandomGen g) => Random p a -> g -> (a, g)
> interpret1 (Pure a) seed = (a, seed)
> interpret1 (Bernoulli prob f) seed = 
>     let (r, seed') = R.random seed
>         b       = if r <= prob then 1 else 0
>     in interpret1 (f b) seed'

You can use the expression R.getStdRandom (interpret1 test1) if you want to generate some random samples for yourself.

We're interested in the expected value, so here's a function to compute that:

> expect1 :: (Fractional p, Ord p, R.Random p, R.RandomGen g) => Random p p -> Int -> g -> (p, g)
> expect1 r n g = 
>     let (x, g') = sum1 0 r n g
>     in (x/fromIntegral n, g')


> sum1 :: (Ord p, Num p, R.Random p, R.RandomGen g) => p -> Random p p -> Int -> g -> (p, g)
> sum1 t r 0 g = (t, g)
> sum1 t r n g =
>     let (a, g') = interpret1 r g
>     in sum1 (t+a) r (n-1) g'

You can test it out with R.getStdRandom (expect1 test1 1000). You should get values around 3.

We can try completely different semantics for Random. This time we compute the entire probability distribution:

> interpret2 :: (Num p) => Random p a -> [(a, p)]
> interpret2 (Pure a) = [(a, 1)]
> interpret2 (Bernoulli p f) =
>     scale p (interpret2 (f 1)) ++ scale (1-p) (interpret2 (f 0))


> scale :: Num p => p -> [(a, p)] -> [(a, p)]
> scale s = map (\(a, p) -> (a, s*p))

You can try it with interpret2 test1.

Unfortunately, as it stands it doesn't collect together multiple occurrences of the same value. We can do that with this function:

> collect :: (Ord a, Num b) => [(a, b)] -> [(a, b)]
> collect = M.toList . M.fromListWith (+)

And now you can use collect (interpret2 test1).

Let's compute some expected values:

> expect2 :: (Num p) => Random p p -> p
> expect2 r = sum $ map (uncurry (*)) (interpret2 r)

The value of expect2 test1 should be exactly 3. One nice thing about interpret2 is that it is differentiable with respect to the Bernoulli parameter when this is meaningful. Unfortunately it has one very big catch: the value of interpret2 can be a very long list. Even a small simulation can results in lists too big to store in the known universe. But interpret1 doesn't produce differentiable results. Is there something in-between these two interpreters?

Importance sampling

Frequently in Monte Carlo sampling it isn't convenient to sample from the distribution you want. For example it might be intractably hard to do so, or you might have proven that the resulting estimate has a high variance. So instead you can sample from a different, but possibly related distribution. This is known as importance sampling. Whenever you do this you must keep track of how "wrong" your probability was and patch up your expectation estimate at the end. For example, suppose a coin comes up heads 3/4 of the time. Instead of simulating a coin toss that comes up 3/4 of the time you could simulate one that comes up heads half of the time. Suppose at one point in the simulation it does come up heads. Then you used a probability of 1/2 when you should have used 3/4. So when you compute the expectation you need to scale the contribution from this sample by (3/4)/(1/2) = 3/2. You need so scale appropriately for every random variable used. A straightforward way to see this for the case of a single Bernoulli variable is to note that

$E(f(X)) = pf(1)+(1-p)f(0) = q\frac{p}{q}f(1)+(1-q)\frac{1-p}{1-q}f(0)$ .

We've replaced probabilities $p$ and $1-p$ with $q$ and $1-q$ but we had to scale $f$ appropriately in each of the cases $f(0)$ and $f(1)$ to keep the final value the same. I'm going to call the scale value the importance. If we generate $n$ random numbers in a row we need to multiply all of the $n$ importance values that we generate. This is a perfect job for the Writer monad using the Product monoid. (See Eric Kidd's paper for some discussion about the connection between Writer and importance sampling.) However I'm just going to write an explicit interpreter for our free monad to make it clear what's going where.

This interpreter is going to take an additional argument as input. It'll be a rule saying what probability we should sample with when handling a variable drawn from $B(p)$ . The probability should be a real number in the interval $[0,1]$ .

> interpret3 :: (Fractional p, R.RandomGen g) =>
>               (p -> Float) -> Random p a -> g -> ((a, p), g)
> interpret3 rule (Pure a) g = ((a, 1), g)
> interpret3 rule (Bernoulli p f) g = 
>     let (r, g') = R.random g
>         prob = rule p
>         (b, i)  = if (r :: Float) <= prob
>           then (1, p/realToFrac prob)
>           else (0, (1-p)/realToFrac (1-prob))
>         ((a, i'), g'') = interpret3 rule (f b) g'
>     in ((a, i*i'), g'')

Here's the accompanying code for the expectation:

> expect3 :: (Fractional p, R.RandomGen g) =>
>            (p -> Float) -> Random p p -> Int -> g -> (p, g)
> expect3 rule r n g = 
>     let (x, g') = sum3 rule 0 r n g
>     in (x/fromIntegral n, g')


> sum3 :: (Fractional p, R.RandomGen g) =>
>         (p -> Float) -> p -> Random p p -> Int -> g -> (p, g)
> sum3 rule t r 0 g = (t, g)
> sum3 rule t r n g =
>     let ((a, imp), g') = interpret3 rule r g
>     in sum3 rule (t+a*imp) r (n-1) g'

For example, you can estimate the expectation of test1 using unbiased coin tosses by evaluating R.getStdRandom (expect3 (const 0.5) test1 1000).

Generalising probability

Did you notice I made my code slightly more general than seems to be needed? Although I use probabilities of type Float to generate my Bernoulli samples, the argument to the function bernoulli can be of a more general type. This means that we can use importance sampling to compute expected values for generalised measures that take values in a more general algebraic structure than the interval [0,1]. For example, we could use negative probabilities. An Operational Interpretation of Negative Probabilities and No-Signalling Models by Adamsky and Brandenberger give a way to interpret expressions involving negative probabilities. We can implement it using interpret3 and the rule \p -> abs p/(abs p+abs (1-p)). Note that it is guaranteed to produce values in the range [0,1] (if you start with dual numbers with real parts that are ordinary probabilities) and reproduces the usual behaviour when given ordinary probabilities.

Here's a simple expression using a sample from " $B(2)$ ":

> test2 = do
>     a <- bernoulli 2
>     return $ if a==1 then 2.0 else 1.0

It's expected value is 3. We can get this exactly using expect2 test2. For a Monte Carlo estimate use

R.getStdRandom (expect3 (\back p -> abs p/(abs p+abs (1-p))) test2 1000)

Note that estimates involving negative probabilities can have quite high variances so try a few times until you get something close to 3 :-)

We don't have to stick with real numbers. We can use this approach to estimate with complex probabilities (aka quantum mechanics) or other algebraic structures.

Discrete yet differentiable

And now comes the trick: automatic differentiation uses the algebra of dual numbers. It's not obvious at all what a probability like $0.5+\epsilon$ means when $\epsilon$ is infinitesimal. However, we can use interpret3 to give it meaningful semantics.

Let'd define the duals in the usual way first:

> data Dual a = D { real :: a, infinitesimal :: a }


> instance (Ord a, Num a) => Num (Dual a) where
>     D a b + D a' b' = D (a+a') (b+b')
>     D a b * D a' b' = D (a*a') (a*b'+a'*b)
>     negate (D a b) = D (negate a) (negate b)
>     abs (D a b) = if a > 0 then D a b else D (-a) (-b)
>     signum (D a b) = D (signum a) 0
>     fromInteger a = D (fromInteger a) 0


> instance (Ord a, Fractional a) => Fractional (Dual a) where
>     fromRational a = D (fromRational a) 0
>     recip (D a b) = let ia = 1/a in D ia (-b*ia*ia)


> instance Show a => Show (Dual a) where
>     show (D a b) = show a ++ "[" ++ show b ++ "]"

Now we can use the rule real to give as a real-valued probability from a dual number. The function expect3 will push the infinitesimal part into the importance value so it doesn't get forgotten about. And now expect3 gives us an estimate that is differentiable despite the fact that our random variables are discrete.

Let's try an expression:

> test3 p = do
>     a <- bernoulli p
>     b <- bernoulli p
>     return $ if a == 1 && b == 1 then 1.0 else 0.0

The expected value is $p^2$ and the derivative is $2p$ . We can evaluate at $p=0.5$ with expect2 (test3 (D 0.5 1)). And we can estimate it with

R.getStdRandom (expect3 real (test4 (D 0.5 1)) 1000)

What's neat is that we can parameterise our distributions in a more complex way and we can freely mix with conventional expressions in our parameter. Here's an example:

> test4 p = do
>     a <- bernoulli p
>     b <- bernoulli (p*p)
>     return $ p*fromIntegral a*fromIntegral b

Try evaluating expect2 (test4 (D 0.5 1)) and

R.getStdRandom (expect3 real (test4 (D 0.5 1)) 1000)

I've collected the above examples together here:

> main = do
>     print =<< R.getStdRandom (interpret1 test1)
>     print $ collect $ interpret2 test1
>     print =<< R.getStdRandom (expect1 test1 1000)
>     print (expect2 test1)
>     print =<< R.getStdRandom (expect3 id test1 1000)
>     print =<< R.getStdRandom (expect3 (const 0.5) test1 1000)
>     print "---"
>     print $ expect2 test2
>     print =<< R.getStdRandom (expect3 (\p -> abs p/(abs p+abs (1-p))) test2 1000)
>     print "---"
>     print $ expect2 (test3 (D 0.5 1))
>     print =<< R.getStdRandom (expect3 real (test3 (D 0.5 1)) 1000)
>     print "---"
>     print $ expect2 (test4 (D 0.5 1))
>     print =<< R.getStdRandom (expect3 real (test4 (D 0.5 1)) 1000)

What just happened?

You can think of a dual number as a real number that has been infinitesimally slightly deformed. To differentiate something we need to deform something. But we can't deform 0 or 1 and have them stay 0 or 1. So the trick is to embed probability sampling in something "bigger", namely importance sampling, where samples carry around an importance value. This bigger thing does allow infinitesimal deformations. And that allows differentiation. This process of turning something discrete into something continuously "deformable" is generally called relaxation.

Implementation details

I've made no attempt to make my code fast. However I don't think there's anything about this approach that's incompatible with performance. There's no need to use a monad. Instead you can track the importance value through your code by hand and implement everything in C. Additionally, I've previously written about the fact that for any trick involving forward mode AD there is another corresponding trick you can use with reverse mode AD. So this method is perfectly comptible with back-propagation. Note also that the dual number importances always have real part 1 which means you don't actually need to store them.

The bad news is that the derivative estimate can sometimes have a high variance. Nonetheless, I've used it successfully for some toy optimisation problems. I don't know if this approach is effective for industrial strength problems. Your mileage may vary :-)

Alternatives

Sometimes you may find that it is acceptable to deform the samples from your discrete distribution. In that case you can use the concrete relaxation.

Continuous variables

The above method can be adapted to work with continuous variables. There is a non-trivial step which I'll leave as an exercise but I've tested it in some Python code. I think it reproduces a standard technique and it gives an alternative way to think about that trick. That article is also useful for ways to deal with the variance issues. Note also that importance sampling is normally used itself as a variance reduction technique. So there are probably helpful ways to modify the rule argument to interpret3 to simultaneously estimate derivatives and keep the variance low.

Personal note

I've thought about this problem a couple of times over the years. Each time I've ended up thinking "there's no easy way to extend AD to work with random variables so don't waste any more time thinking about it". So don't listen to anything I say. Also, I like that this method sort of comes "for free" once you combine methods I've described previously.

Acknowledgements

I think it was Eric Kidd's paper on building probability monads that first brought to my attention that there are many kinds of semantics you can use with probability theory - i.e. there are many interpreters you can write for the Random monad. I think there is an interesting design space worth exploring here.

Answer to exercise

I set the continuous case as an exercise above. Here is a solution.

Suppose you're sampling from a distribution parameterised by $\theta$ with pdf $p(\theta, \cdot)$ . To compute the derivative with respect to $\theta$ you need to consider sampling from $p(\theta+\epsilon, \cdot)$ where $\epsilon$ is an infinitesimal.

$p(\theta+\epsilon, x) = p(\theta, x)+\epsilon\frac{\partial p(\theta, x)}{\partial\theta}$ .

As we don't know how to sample from a pdf with infinitesimals in it, we instead sample using $p$ as usual, but use an importance of

$(p(\theta, x)+\epsilon\frac{\partial p(\theta, x)}{\partial\theta})/p(\theta, x) = 1+\epsilon\frac{\partial}{\partial\theta}\log p(\theta, x)$

The coefficient of the $\epsilon$ gives the derivative. So we need to compute the expectation, scaling each sample with this coefficient. In other words, to estimate $E(f(X))$ we use

$\frac{1}{N}\sum_{i=1}^N f(x_i)\frac{\partial}{\partial\theta}\log p(\theta, x_i)$

where the $x_i$ are drawn from the original distribution. This is exactly what is described at Shakir Mohamed's blog.

Final word

I managed to find the method in the literature. It's part of the REINFORCE method. For example, see equation (5) there.

A Neighborhood of Infinity

Search This Blog

Saturday, October 14, 2017

A tail we don't need to wag

Friday, August 11, 2017

What is a photon?

Saturday, July 15, 2017

Self-referential logic via self-referential circuits

Tuesday, June 06, 2017

A relaxation technique

Blog Archive