Saturday, October 14, 2017

A tail we don't need to wag


I've been reading a little about concentration inequalities recently. I thought it would be nice to see if you can use the key idea, if not the actual theorems, to reduce the complexity of computing the probability distribution of the outcome of stochastic simulations. Examples might include random walks, or queues.

The key idea behind concentration inequalities is that very often most of the probability is owned by a small proportion of the possible outcomes. For example, if we toss a fair coin enough (say ) times we expect the number of heads to lie within of the mean about 99.99% of the time despite there being different total numbers possible. The probable outcomes tend to concentrate around the expectation. On the other hand, if we consider not the total number of heads, but the possible sequences of tosses, there are possibilities, all equally likely. In this case there is no concentration. So a key ingredient here is a reduction operation: in this case reducing a sequence of tosses to a count of the number that came up heads. This is something we can use in a computer program.

I (and many others) have written about the "vector space" monad that can be used to compute probability distributions of outcomes of simulations and I'll assume some familiarity with that. Essentially it is a "weighted list" monad which is similar to the list monad except that in addition to tracking all possible outcomes, it also propagates a probability along each path. Unfortunately it needs to follow through every possible path through a simulation. For example, in the case of simulating coin tosses it needs to track different possiblities, even though we're only interested in the possible sums. If, after each bind operation of the monad, we could collect together all paths that give the same total then we could make this code much more efficient. The catch is that to collect together elements of a type the elements need to be comparable, for example instances of Eq or Ord. This conflicts with the type of Monad which requires that we can use the >>= :: m a -> (a -> m b) -> m b and return :: a -> m a functions with any types a and b.

I'm going to deal with this by adapting a technique presented by Oleg Kiselyov for efficiently implementing the Set monad. Instead of Set I'm going to use the Map type to represent probability distributions. These will store maps saying, for each element of a type, what the probability of that element is. So part of my code is going to be a direct translation of that code to use the Map type instead of the Set type.

> {-# LANGUAGE GADTs, FlexibleInstances #-}
> {-# LANGUAGE ViewPatterns #-}

> module Main where

> import Control.Monad > import Control.Arrow > import qualified Data.Map as M > import qualified Data.List as L

The following code is very similar to Oleg's. But for first reading I should point out some differences that I want you to ignore. The type representing a probability distribution is P:

> data P p a where
>     POrd :: Ord a => p -> M.Map a p -> P p a
>     PAny :: p -> [(a, p)] -> P p a

But note how the constructors take two arguments - a number that is a probability, in addition to a weighted Map or list. For now pretend that first argument is zero and that the functions called trimXXX act similarly to the identity:

> instance (Ord p, Num p) => Functor (P p) where
>     fmap = liftM

> instance (Ord p, Num p) => Applicative (P p) where > pure = return > (<*>) = ap

> instance (Ord p, Num p) => Monad (P p) where > return x = PAny 0 [(x, 1)] > m >>= f = > let (e, pdf) = unP m > in trimAdd e $ collect $ map (f *** id) pdf

> returnP :: (Ord p, Num p, Ord a) => a -> P p a > returnP a = POrd 0 $ M.singleton a 1

> unP :: P p a -> (p, [(a, p)]) > unP (POrd e pdf) = (e, M.toList pdf) > unP (PAny e pdf) = (e, pdf)

> fromList :: (Num p, Ord a) => [(a, p)] -> M.Map a p > fromList = M.fromListWith (+)

> union :: (Num p, Ord a) => M.Map a p -> M.Map a p -> M.Map a p > union = M.unionWith (+)

> scaleList :: Num p => p -> [(a, p)] -> [(a, p)] > scaleList weight = map (id *** (weight *))

> scaleMap :: (Num p, Ord a) => p -> M.Map a p -> M.Map a p > scaleMap weight = fromList . scaleList weight . M.toList

This is a translation of Oleg's crucial function that allows us to take a weighted list of probability distributions and flatten them down to a single probability distribution:

> collect :: Num p => [(P p a, p)] -> P p a
> collect []  = PAny 0 []
> collect ((POrd e0 pdf0, weight) : rest) =
>     let wpdf0 = scaleMap weight pdf0
>     in case collect rest of
>       POrd e1 pdf1 -> POrd (weight*e0+e1) $ wpdf0 `union` pdf1
>       PAny e1 pdf1 -> POrd (weight*e0+e1) $ wpdf0 `union` fromList pdf1
> collect ((PAny e0 pdf0, weight) : rest) =
>     let wpdf0 = scaleList weight pdf0
>     in case collect rest of
>       POrd e1 pdf1 -> POrd (weight*e0+e1) $ fromList wpdf0 `union` pdf1
>       PAny e1 pdf1 -> PAny (weight*e0+e1) $ wpdf0 ++ pdf1

But now I really must explain what the first argument to POrd and PAny is and why I have all that "trimming".

Even though the collect function allows us to reduce the number of elements in our PDFs, we'd like to take advantage of concentration of probability to reduce the number even further. The trim function keeps only the top probabilities in a PDF, discarding the rest. To be honest, this is the only point worth taking away from what I've written here :-)

When we throw away elements of the PDF our probabilities no longer sum to 1. So I use the first argument of the constructors as a convenient place to store the amount of probability that I've thrown away. The trim function keeps the most likely outcomes and sums the probability of the remainder. I don't actually need to keep track of what has been discarded. In principle we could reconstruct this value by looking at how much the probabilities in our trimmed partial PDFs fall short of summing to 1. But confirming that our discarded probability and our partial PDF sums to 1 gives a nice safety check for our code and can give us some warning if numerical errors start creeping in. I'll call the total discarded probability the tail probability.

Here is the core function to keep the top values. In this case is given by a global constant called trimSize. (I'll talk about how to do this better later.)

> trimList :: (Ord p, Num p) => [(a, p)] -> (p, [(a, p)])
> trimList ps =
>     let (keep, discard) = L.splitAt trimSize (L.sortOn (negate . snd) ps)
>     in (sum (map snd discard), keep)

> trimAdd :: (Ord p, Num p) => p -> P p a -> P p a > trimAdd e' (POrd e pdf) = > let (f, trimmedPdf) = trimList (M.toList pdf) > in POrd (e'+e+f) (M.fromList trimmedPdf) > trimAdd e' (PAny e pdf) = > let (f, trimmedPdf) = trimList pdf > in PAny (e'+e+f) trimmedPdf

> runP :: (Num p, Ord a) => P p a -> (p, M.Map a p) > runP (POrd e pdf) = (e, pdf) > runP (PAny e pdf) = (e, fromList pdf)

And now some functions representing textbook probability distributions. First the uniform distribution on a finite set. Again this is very similar to Oleg's chooseOrd function apart from the fact that it assigns weights to each element:

> chooseP :: (Fractional p, Ord p, Ord a) =>
>            [a] -> P p a
> chooseP xs = let p = 1/fromIntegral (length xs)
>              in POrd 0 $ fromList $ map (flip (,) p) xs

And the Bernoulli distribution, i.e. tossing a Bool coin that comes up True with probability :

> bernoulliP :: (Fractional p, Ord p) =>
>               p -> P p Bool
> bernoulliP p = POrd 0 $ fromList $ [(False, 1-p), (True, p)]

Now we can try a random walk in one dimension. At each step we have a 50/50 chance of standing still or taking a step to the right:

> random_walk1 :: Int -> P Double Int
> random_walk1 0 = returnP 0
> random_walk1 n = do
>     a <- random_walk1 (n-1)
>     b <- chooseP [0, 1]
>     returnP $ a+b

Below in main we take 2048 steps but only track 512 probabilities. The tail probability in this case is about . So only tracking 1/4 of the outcomes has had almost no impact on the numbers. This also illustrates why it is good to track the tail probabilities rather than inferring them from the missing probabilities in the bulk of the PDF - they can be so small they vanish compared to floating poimnt errors. We can afford to track a lot fewer than 512 (out of 2049 possible) outcomes and still have a good representative PDF.

Now here's a two-dimensional random walk for 32 steps. The tail probability is about so we are getting a reasonably representative PDF. We have to run fewer steps than before, however, because the space of possible outcomes spans two dimensions, meaning that reduction doesn't help as much as it does in one dimension.

> random_walk2 :: Int -> (Int, Int) -> P Double (Int, Int)
> random_walk2 0 (x, y) = returnP (x, y)
> random_walk2 n (x, y) = do
>     (x',y') <- random_walk2 (n-1) (x, y)
>     dx <- chooseP [-1, 1]
>     dy <- chooseP [-1, 1]
>     returnP (x'+dx, y'+dy)

One last simulation. This is a queing scenario. Tasks come in once every tick of the clock. There are four queues a task can be assigned to. A task is assigned to the shortest queue. Meanwhile each queue as a 1/4 probability of clearing one item at each tick of the clock. We build the PDF for the maximum length any queue has at any time.

The first argument to queue is the number of ticks of the clock. The second argument is the list of lengths of the queues. It returns a PDF, not just on the current queue size, but also on the longest queue it has seen.

> queue :: Int -> [Int] -> P Double (Int, [Int])
> queue 0 ls = returnP (maximum ls, ls)
> queue n ls = do
>     (longest, ls1) <- queue (n-1) ls
>     ls2 <- forM ls1 $ \l -> do
>         served <- bernoulliP (1/4)
>         returnP $ if served && l > 0 then l-1 else l
>     let ls3 = L.sort $ head ls2+1 : tail ls2
>     returnP (longest `max` maximum ls3, ls3)

For the queing simulation the tail probability is around despite the fact that we have discarded a vast possible set of possible outcomes.

It's a little ugly that trimSize is a global constant:

> trimSize = 512

The correct solution is probably to separate the probability "syntax" from its "semantics". In other words, we should implement a free monad supporting the language of probability with suitable constructors for bernoulliP and choiceP. We can then write a separate interpreter which takes a trimSize as argument. This has another advantage too: the Monad above isn't a true monad. It uses a greedy approach to discarding probabilities and different rearrangements of the code, that ought to give identical results, may end up diferent. By using a free monad we ensure that our interface is a true monad and we can put the part of the code that breaks the monad laws into the interpreter. The catch is that my first attempt at writing a free monad resulted in code with poor performance. So I'll leave an efficient version as an exercise :-)

> main = do
>     print $ runP $ random_walk1 2048
>     print $ runP $ random_walk2 32 (0, 0)
>     print $ runP $ do
>         (r, _) <- queue 128 [0, 0, 0, 0]
>         returnP r

Friday, August 11, 2017

What is a photon?


Popular science writing about quantum mechanics leaves many people full of questions about the status of photons. I want to answer some of these without using any tricky mathematics.

One of the challenges is that photons are very different to ordinary everyday objects like billiard balls. This is partly because photons are described by quantum mechanics whereas billiard balls are better modelled with classical Newtonian mechanics. Quantum mechanics defies many of our intuitions. But it's also because the word photon plays by different linguistic rules to billiard ball. I hope to explain why.

One of my goals is to avoid saying anything original. I'm largely going remove the mathematics from material I first learnt from three or so courses I took at Cambridge University many years ago: Quantum Mechanics, Solid State Physics and Quantum Field Theory. I also learnt about some of this from David Miller at Stanford University who talked a little about what properties it is meaningful to apply to a photon. (I hope I haven't misrepresented him too badly.)

The simple harmonic oscillator

Here's a mass hanging on a spring:

Suppose it's initially sitting in equilibrium so that the net force acting on it is zero. Now we lift the mass a small distance and let it go. Because we lifted it, we shortened the spring, reducing its tension. This means the force due to gravity is now more than the spring tension and the mass falls. Eventually it falls below the equilibrium point, increasing the tension in the spring so there is a net force pulling it back up again. To a good approximation, the force restoring the mass to its equilibrium point is proportional to how far it has been displaced. When this happens we end up with oscillating motion where the mass bounces up and down. Here's what a graph of its displacement looks like over time:

It's actually a sine wave but that detail doesn't matter for us right now.

An oscillator where the restoring force is proportional to the displacement from the equilibrium point is called a simple harmonic oscillator and its oscillation is always described by a sine wave.

Note that I'm ignoring friction here. This is a reasonable approximation for many physical systems.

Masses on springs aren't all that important in themselves. But simple harmonic oscillators are very common. Another standard example is the pendulum swinging under the influence of gravity:

At a more fundamental level, an example might be an atom in a crystal being held in place by electrostatic forces from its neighbouring atoms.

If you have one of these systems, then in principle you can set it in motion with as little energy as you like. Pull a mass on a spring down a little bit and it will bounce back up, oscillating a certain amount. Pull the mass down half the amount and it'll bounce with oscillations half the size. In principle we could keep repeating this experiment, each time starting with the mass displaced half the amount we tried previously. In other words, a simple harmonic oscillator can have any energy we like. The spectrum of possible energies of one of these oscillators is continuous. (Note that the word spectrum here is merely physicist-speak for a set of possible values.) If we can set one in motion with 1 unit of energy then we can also set it oscillating with 0.5 units, or 0.01 units, or 0.000123 units of energy.

Quantum mechanics

Everything I've said above is assuming that classical Newtonian mechanics is valid. But we know that for very small systems, around the size of a few atoms or smaller, we need to use quantum mechanics. This is an enormous topic but I'm only going to extract one basic fact. According to quantum mechanics, a simple harmonic oscillator isn't free to oscillate with any energy you like. The possible energy levels, the spectrum of the system, is discrete. There is a lowest energy level, and then all of the energy levels above that are equally spaced like so, going up forever:

We usually call the lowest energy level the ground state or vacuum state and call the higher levels excited states.

The spacing of the energy levels depends on the stiffness of the system, which is just a measure of how much the restoring force increases with displacement from equilibrium. Stiffer systems will have a higher frequency of oscillation and a bigger spacing between the energy levels.

(I'm deliberately not saying anything about why we get discrete energy levels in quantum mechanics. I just want to use this one fact so I can get on and talk about photons eventually.)

In practice the difference in energy between one level and the next is tiny. This means that if you're literally fiddling about with a mass on a spring you won't ever feel the discreteness. The amount your hand trembles is many orders of magnitude greater than the effect of this discreteness. Nonetheless, it is extremely important when modeling microscopic systems.

Quantum linguistics

Here are some English sentences we could say about the kinds of systems I've described so far:

  1. This system is in the ground state.
  2. That system is in its first excited state
  3. This system is at an energy level higher than that system
  4. After allowing these two similar oscillators to interact, the energy level of this oscillator went down and the energy level of that one went up by the same amount.

Now I want to introduce the (count) noun quantum, with plural quanta. The idea here is not that I'm telling you about a new entity. I want to present this as a new way to talk about things I've already introduced. So rather than give a definition of quantum I will instead show how you can rewrite the above sentences using the language of quanta:

  1. There are no quanta in this system
  2. That system has one quantum of energy
  3. This system has more quanta than that one
  4. Some quanta were transferred from this system to that system.

Those sentences make it seem like I'm talking about a new kind of object - the quantum. But I'm not. They're just a manner of speaking about energy levels. I hope I've given you enough examples to get the idea.

Just in case you think it's weird to talk about energy levels in terms of quanta, I'd like to remind you that you already do this all the time with money. Dollar bills are actual objects that exist in the world. But money in your bank account isn't. Somewhere in some database is a representation of how much money you have. You might say "I have one hundred dollars in my savings account" But those dollars certainly don't exist as distinct entities. It doesn't really make sense to talk about the thirty-seventh dollar in your bank account. You can transfer dollars from one account to another, and yet what's really happening is that two totals are being adjusted. We treat these accounts a lot like they're containers holding individual objects called dollars. Certainly our language is set up like that. But we know that it's really just the totals that have any kind of representation. The same goes for quanta. It's just a manner of speaking about systems that can have different amounts of energy and where the spectrum of energy levels forms a ladder with equally spaced rungs. Because of your experience with money I probably don't need to give you any more examples.

One more bit of terminology: when the spectrum of energies is discrete it's said to be quantised.

Coupled systems

Let's return to classical physics with a slightly more complex system consisting of two masses connected to springs. We ignore gravity now:

We restrict ourselves to just considering back and forth motion constrained along a horizontal line. This is a coupled system. If the left mass moves to the right, not only does it experience a restoring force pushing it left, but the mass on the right will experience more of a force pushing it to the left. We can't treat the masses as independent and so we don't get the simple solution of each mass always oscillating with a sine wave.

For this particular problem though there's a trick to turn it into a pair of harmonic oscillators. The idea is to consider the pair of masses as a single entity. We can think of the motion centre of mass of the pair, the midpoint between them, as being one variable that describes this entity. Let's call its motion the external motion. We can also think of the distance between the two masses in the pair as being the system's internal motion. (I'm just using internal and external as convenient names. Don't read too much into them.) It turns out that when you analyse this using classical dynamics the internal motion and the external motion act like independent quantities. What's more, each one behaves exactly like it's simple harmonic. So we get one sine wave describing the overall motion of the pair, and another one that describes how the elements of the pair oscillate with respect to each other.

The frequencies of the internal and external motions are typically different. So you can end up with some quite complicated motions with two different frequencies beating against each other.

When we're able to find ways to split up the motion into independent quantities, each of which is simple harmonic, each kind of motion is said to be a normal mode.

When you have independent normal modes, you can treat them independently in quantum mechanics too. So what we get is that the spectrum of possible energy levels for this system is, in some sense, two-dimensional. We can put quanta into the internal oscillation and we can also put quanta into the external oscillation. Because these modes have different frequencies the quanta for each mode correspond to different amounts of energy.

(And a reminder: when talking about quantum mechanics I'm not talking literally about masses on springs. I'm talking about physical systems that have equations of motion that mean they behave like masses on springs. In this case it might be a pair of particles trapped in a microscopic well with a repulsive force between them.)

Solid state physics

Now I'm going to jump from just two masses to a large number of them. For example, the behavior of trillions of atoms in a solid crystal can be approximately modelled by a grid of masses and springs, of which the following diagram is just a tiny piece:

A real crystal would be arranged in a 3D lattice but I've drawn 2D here for convenience.

Think of the springs as both pushing apart atoms that get close, and pulling together atoms that move apart.

This is a highly coupled system. Ultimately every atom in our lattice is connected to every other one, either directly, or indirectly. Nonetheless, it is still possible to find normal modes. The normal modes all have the same basic form: they are all sinusoidal waves of displacement traveling in some direction with some speed and oscillation frequency. Each of these modes consists of waves that extend through the entire crystal, with fixed spacing between parallel planar wavefronts. This type of waves is known as a plane wave. If the system is perfectly harmonic, so the restoring force is precisely proportional to the displacement, then each direction and frequency of wave oscillates its way through the crystal completely independently of any other. Just as how in the example with two masses any possible oscillation is a combination of internal and external motion, for a crystal lattice any motion is a combination of these plane waves. (Decomposing any oscillation as a combination of plane waves is known as computing its Fourier transform.

Now we're ready to consider this situation quantum mechanically. Because each plane wave is a normal mode, we can treat each one as an independent simple harmonic oscillator. This means that the energy in each plane wave is quantised. So when we consider a crystal lattice quantum mechanically we find that its states consist of plane waves propagating through it, but where the amount of energy in each wave is given by a discrete spectrum. So again we can talk about how many quanta there are in each mode.

Linguistically it gets a bit more interesting now. Each plane wave is associated with a particular direction and speed so it makes sense to talk of these quanta as having a direction and speed. But note that statements involving quanta are still really just sentences about energy levels. So, for example, the statement "the mode of this system with this velocity and frequency is in its first excited state" is, by definition, exactly the same as "this system has precisely one quantum with this velocity and frequency". In particular, when we write sentences like these we aren't implying that there is some new kind of object, the quantum, that has suddenly attached itself to our crystal. The quanta are properties of the lattice. By the way, in the particular case of vibrating atoms in a lattice, the quanta are known by a special name: phonons.

Quantum field theory and photons

And now we're ready to move onto photons.

In classical physics, electromagnetism is described by Maxwell's equations. Maxwell's equations say that a varying magnetic field generates an electric field and a varying electric field generates a magnetic field. The result is that it is possible for an oscillating electric field to create an oscillating electric field so that an electric field can propagate through space on its own without the help of electric charges or electric currents or any other kind of `generator'. As these electric fields also produce magnetic fields that propagate with them, the whole thing is called an electromagnetic wave.

Just like displacements in a crystal lattice, an electromagnetic wave also has normal modes. The normal modes are plane waves traveling at the speed of light in a particular directions with a given frequency. You have personal experience of this. Visible light is electromagnetic radiation with a frequency of around 500 THz. Wifi uses signals at around 5 GHz. The radio might use signals at around 100 MHz. When you surf the web wirelessly while listening to the radio, the wifi signals don't interfere with your vision or the radio signal. (Actually, wifi might interfere with the radio signals, but not because of the 5 GHz signals. It might happen if badly manufactured hardware emits stray signals around the 100 MHz band.) That's because these waves pass through each other without being coupled to each other in any way. And at this point you might already be guessing what a photon is. For each choice of frequency and direction (and also polarisation, but that's just a detail) the amount of energy that can be in the corresponding mode is quantised. For the electromagnetic field the quanta are called photons.

And that's it!

Electromagnetic waves can be thought of as being made up of different oscillation modes. Because of quantum mechanics, each mode contains an amount of energy that is quantised to be a whole number multiple of some base amount. Although the thing that really matters is the total amount of energy in the modes, it can still be useful to talk about this total as if it's a collection of entities called photons.

One thing to notice is that the normal modes for an electromagnetic wave are plane waves that are extended in space. In principle all the way across the universe but for practical problems physicists often consider electromagnetic waves in a large but finite box. This means that adding a quantum to a system has an effect that extends across the entire system. That makes it problematic to talk about the location of a photon.


Physicists sometimes use the word photon in slightly different but related ways. I've described what I think of as the core definition as presented in many courses on quantum field theory.


Thanks to @dmoore2718 for encouraging me to edit this document down to a better size.

Saturday, July 15, 2017

Self-referential logic via self-referential circuits


TL;DR The behaviour of a certain kind of delay component has a formal similarity to Löb's theorem which gives a way to embed part of provability logic into electronic circuits.

Here's a famous paradoxical sentence:

This sentence is false

If it's false then it's true and if it's true then it's false.

Here's a paradoxical electronic circuit:

The component in the middle is an inverter. If the output of the circuit is high then its input is high and then it's output must be low, and vice versa.

There's a similarity here. But with a bit of tweaking you can turn the similarity into an isomorphism of sorts.

In the first case we avoid paradox by noting that in the mathematical frameworks commonly used by mathematicians it's impossible, in general, for a statement to assert it's own falsity. Instead, a statement can assert its own unprovability and then we get Gödel's incompleteness theorems and a statement that is apparently true and yet can't be proved.

In the second case we can't model the circuit straightforwardly as a digital circuit. In practice it might settle down to a voltage that lies between the official high and low voltages so we have to model it as an analogue circuit. Or instead we can introduce a clock and arrange that the feedback in the circuit is delayed. We then get an oscillator circuit that can be thought of as outputting a stream of bits.

The observation I want to make is that if the feedback delay is defined appropriately, these two scenarios are in some sense isomorphic. This means that we can model classic results about provability, like Gödel's incompleteness theorems, using electronic circuits. We can even use such circuits to investigate what happens when logicians or robots play games like Prisoner's Dilemma. I'll be making use of results found in Boolos' book on The Logic of Provability and some ideas I borrowed from Smoryński's paper on Fixed Point Algebras. I'll be assuming the reader has at least a slight acquaintance with ithe ideas behind provability logic.

Provability Logic

There are many descriptions of provability logic (aka GL) available online, so I'm not going to repeat it all here. However, I've put some background material in the appendix below and I'm going to give a very brief reminder now.

Start with (classical) propositional calculus which has a bunch of variables with names like and connectives like for AND, for OR, for NOT and for implication. (Note that .)

Provability logic extends propositional calculus by adding a unary operator . (I apologise, that's meant to be a □ but it's coming out like in LaTeX formulae. I think it's a bug in Google's LaTeX renderer.) The idea is that asserts that is provable in Peano Arithmetic, aka PA. In addition to the axioms of propositional calculus we have

as well as a rule that allows us to deduce from .

We also have this fixed point property:

Let be any predicate we can write in the language of GL involving the variable , and suppose that every appearance of in is inside a , e.g. . Then there is a fixed point, i.e. a proposition that makes no mention of such that is a theorem. In effect, for any such , is a proposition that asserts .

See the appendix for a brief mention of why we should expect this to be true.

From the fixed point property we can deduce Löb's theorem: . There is a proof at wikipedia that starts from the fixed point property.

We can also deduce the fixed point property from Löb's theorem so it's more usual to take Löb's theorem as an axiom of GL and show that the fixed point property follows. You can think of Löb's theorem as a cunning way to encode the fixed point property. In fact you can argue that it's a sort of Y-combinator, the function that allows the formation of recursive fixed points in functional programming languages. (That's also, sort of, the role played by the loeb function I defined way back. But note that loeb isn't really a proof of Löb's theorem, it just has formal similarities.)

Back to electronic circuits

In order to make digital circuits with feedback loops well-behaved I could introduce a circuit element that results in a delay of one clock cycle. If you insert one of these into the inverter circuit I started with you'll end up with an oscillator that flips back and forth between 0 and 1 on each clock cycle. But I want to work with something slightly stricter. I'd like my circuits to eventually stop oscillating. (I have an ulterior motive for studying these.) Let me introduce this component:

It is intended to serve as a delayed latch and I'll always have the flow of data being from left to right. The idea is that when it is switched on it outputs 1. It keeps outputting 1 until it sees a 0 input. When that happens, then on the next clock cycle its output drops to 0 and never goes back up to 1 until reset.

Because the output of our delay-latch isn't a function of its current input, we can't simply describe its operation as a mathematical function from to . Instead let's think of electronic components as binary operators on bitstreams, i.e. infinite streams of binary digits like ...00111010 with the digits emerging over time starting with the one written on the right and working leftwards. The ordinary logic gates perform bitwise operations which I'll represent using the operators in the C programming language. For example,

...001110 & ...101010 = ...001010
~...101 = ...010
and so on. Let's use □ to represent the effect of latch-delay on a bitstream. We have, for example,
□...000 = ...001
□...11101111 = ...00011111.
The operator □ takes the (possibly empty) contiguous sequence of 1's at the end of the bitstream, extends it by one 1, and sets everything further to the left to 0. If we restrict ourselves to bitstreams that eventually become all 0's or all 1's on the left, then bitstreams are in one-to-one correspondence with the integers using the twos complement representation. For example ...111111, all 1's, represents the number -1. I'll simply call the bistreams that represent integers integers. With this restriction we can use a classic C hacker trick to write □p=p^(p+1) where ^ is the C XOR operator. The operator □ outputs the bits that get flipped when you add one.

Let's use the symbol so that a → b is shorthand for ~a|b. Here are some properties of □:

1. □(-1) = -1

2. □p → □□p = -1

3. □(p → q) → □p → □q = -1

In addition we have the fixed point property:

Let F(p) be any function of p we can write using □ and the bitwise logical operators and such that all occurrences of p occur inside □. Then there is a unique bitstream q such that q=F(q).

We can make this clearer if we return to circuits. F(p) can be thought of as a circuit that takes p as input and outputs some value. We build the circuit using only boolean logic gates and delay-latch. We allow feedback loops, but only ones that go through delay-latches. With these restrictions it's pretty clear that the circuit is well-behaved and deterministically outputs a bitstream.

We also have the Löb property:

4. □(□p → p) → □p = -1

We can see this by examining the definition of □. Intuitively it says something like "once □ has seen a 0 input then no amount of setting input bits to 1 later in the stream make any different to its output".

I hope you've noticed something curious. These properties are extremely close to the properties of in GL. In fact, these electronic circuits form a model of the part of GL that doesn't involve variable names, i.e. what's known as letterless GL. We can formalise this:

1. Map to a wire set to 0, which outputs ...000 = 0.

2. Map to a wire set to 1, which outputs ...111 = -1.

3. Map , where is a binary connective, by creating a circuit that takes the outputs from the circuits for and and passes them into the corresponding boolean logic gate.

4. Map to the circuit for piped through a delay-latch.

For example, let's convert into a circuit. I'm translating to the circuit for .

I'm using red wires to mean wires carrying the value 1 rather than 0. I hope you can see that this circuit eventually settles into a state that outputs nothing but 1s.

We have this neat result:

Because delay-latch satisfies the same equations as in provability logic, any theorem, translated into a circuit, will produce a bistream of just 1s, i.e. -1.

But here's a more surprising result: the converse is true.

If the circuit corresponding to a letterless GL proposition produces a bistream of just 1s then the proposition is actually a theorem of GL.
I'm not going to prove this. (It's actually a disguised form of lemma 7.4 on p.95 of Boolos' book.) In the pictured example we got ...1111, so the circuit represents a theorem. As it represents Löb's theorem for the special case we should hope so. More generally, any bitstream that represents an integer can be converted back into a proposition that is equivalent to the original proposition. This means that bitstreams faithfully represent propositions of letterless GL. I'm not going to give the translation here but it's effectively given in Chapter 7 of Boolos. I'll use to represent the translation from propositions to bitstreams via circuits that I described above. Use to represent the translation of bitstream back into propositions. We have . But I haven't given a full description of and I haven't proved here that it has this property.

Circuits with feedback

In the previous section I considered letterless propositions of GL. When these are translated into circuits they don't have feedback loops. But we can also "solve equations" in GL using circuits with feedback. The GL fixed point theorem above says that we can "solve" the equation , with one letter , to produce a letterless proposition such that . Note here that is a letter in the language of GL. But I'm using to represent a proposition in letterless GL. If we build a circuit to represent , and feed its output back into where appears, then the output bitstream represents the fixed point. Here's a translation of the equation :

I'll let you try to convince yourself that such circuits always eventually output all 0's or all 1's. When we run the circuit we get the output ...1111000 = -8. As this is not -1 we know that the fixed point isn't a theorem. If I'd defined above you could use it to turn the bitstream back into a proposition.

The same, syntactically (optional section)

I have a Haskell library on github for working with GL: provability. This uses a syntactic approach and checks propositions for theoremhood using a tableau method. We can use it to analyse the above example with feedback. I have implemented a function, currently called value', to perform the evaluation of the bitstream for a proposition. However, in this case the fixedpoint function computes the fixed point proposition first and then converts to a bitstream rather than computing the bitstream directly from the circuit for F:

> let f p = Neg (Box p \/ Box (Box (Box p)))
> let Just p = fixedpoint f
> p
Dia T /\ Dia (Dia T /\ Dia (Dia T /\ Dia T))
> value' p

(Note that Dia p means .)

The function fixedpoint does a lot of work under the hood. (It uses a tableau method to carry out Craig interpolation.) The circuit approach requires far less work.


1. Programs that reason about themselves

In principle we can write a program that enumerates all theorems of PA. That means we can use a quine trick to write a computer program that searches for a proof, in PA, of its own termination. Does such a program terminate?

We can answer this with Löb's theorem. Let "The program terminates". The program terminates if it can prove its termination. Formally this means we assume . Using one of the derivation rules of GL we get . Löb's theorem now gives us . Feed that back into our original hypothesis and we get . In other words, we deduce that our program does in fact terminate. (Thanks to Sridhar Ramesh for pointing this out to me.)

But we can deduce this using a circuit. We want a solution to . Here's the corresponding circuit:

It starts by outputting 1's and doesn't stop. In other words, the fixed point is a theorem. And that tells us is a theorem. And hence that the program terminates.

2. Robots who reason about each others play in Prisoner's Dilemma

For the background to this problem see Robust Cooperation in the Prisoner's Dilemma at LessWrong. We have two robot participants and playing Prisoner's Dilemma. Each can examine the other's source code and can search for proofs that the opponent will cooperate. Suppose each robot is programmed to enumerate all proofs of PA and cooperate if it finds a proof that its opponent will cooperate. Here we have "A will cooperate" and "B will cooperate". Our assumptions about the behaviour of the robots are and , and hence that . This corresponds to the circuit:

This outputs ...1111 = -1 so we can conclude and hence that these programs will cooperate. (Note that this doesn't work out nicely if robot B has a program that doesn't terminate but whose termination isn't provable in the formal system A is using. That means this approach is only good for robots that want to cooperate and want to confirm such cooperation. See the paper for more on this.)

At this point I really must emphasise that these applications are deceptively simple. I've shown how these simple circuits can answer some tricky problems about provability. But these aren't simply the usual translations from boolean algebra to logic gates. They work because circuits with delay-latch provide a model for letterless provability logic and that's only the case because of a lot of non-trivial theorem proving in Boolos that I haven't reproduced here. You're only allowed to use these simple circuits once you've seen the real proofs :-)

Things I didn't say above

1. I described the translation from propositions to circuits that I called above. But I didn't tell you what looks like. I'll leave this as an exercise. (Hint: consider the output from the translation of into a circuit.)

2. The integers, considered as bistreams, with the bitwise operators, and the unary operator □p=p^(p+1), form an algebraic structure. For example, if we define ⋄p=~□~p we have a Magari algebra. Structures like these are intended to capture the essential parts of self-referential arguments in an algebraic way.

3. Because of the interpretation of □ as a delayed latch in a circuit you could view it as saying "my input was always true until a moment ago". This surely embeds provability logic in a temporal logic of some sort.

4. (Deleted speculations about tit-for-tat that need rethinking.)

5. For even the most complex letterless proposition in Boolos you could check its theoremhood with a pretty small circuit. You could even consider doing this with a steam powered pneumatic circuit. I had to say that to fulfil a prophecy and maintain the integrity of the timeline.

Appendix on provability

The modern notion of a proof is that it is a string of symbols generated from some initial strings called "axioms" and some derivation rules that make new strings from both axioms and strings you've derived previously. Usually we pick axioms that represent "self-evident" truths and we pick derivation rules that are "truth-preserving" so that every proof ends at a true proposition of which it is a proof. The derivation rules are mechanical in nature: things like "if you have this symbol here and that symbol there then you can replace this symbol with that string you derived earlier" etc.

You can represent strings of symbols using numbers, so-called Gödel numbers. Let's pick a minimal mathematical framework for working with numbers: Peano Arithmetic, aka PA. Let's assume we've made some choice of Gödel numbering scheme and when is a proposition, write for the number representing . You can represent the mechanical derivation rules as operations on numbers. And that makes it possible to define a mathematical predicate that is true if and only if its argument represents a provable proposition.

In other words, we can prove using PA if and only if is a proposition provable in PA.

The predicate has some useful properties:

1.If we can prove , then we can prove .

We take the steps we used to prove , and convert everything to propositions about numbers. If is defined correctly then we can convert that sequence of numbers into a sequence of propositions about those numbers that makes up a proof of .

2. and imply

A fundamental step in any proof is modus ponens, i.e. that and implies . If does its job correctly then it had better know about this.

3. implies

One way is to prove this is to use Löb's theorem.


The trivially true statement had better be provable or is broken.

Constructing is conceptually straightforward but hard work. I'm definitely not going to do it here.

And there's one last thing we need: self-reference. If is a proposition, how can we possibly assert without squeezing a copy of inside ? I'm not going to do that here either - just mention that we can use a variation of quining to achieve this. That allows us to form a proposition for which we can prove . In fact, we can go further. We can find propositions that solve for any predicate built from the usual boolean operations and as long as all of the occurrences of are inside the appearances of . Even though we can't form a proposition that directly asserts its own falsity, we can form one that asserts that it is unprovable, or one that asserts that you can't prove that you can't prove that you can prove it, or anything along those lines.

Anyway, all that and business is a lot of hassle. Provability logic, also known as GL, is intended to capture specifically the parts of PA that relate to provability. GL is propositional calculus extended with the provability operator . The intention is that if is a proposition, is a proposition in GL that represents in PA. The properties of above become the axioms and derivation rules of GL in the main text.