tag:blogger.com,1999:blog-112951322020-10-19T11:46:30.890-07:00A Neighborhood of Infinitysigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.comBlogger299125tag:blogger.com,1999:blog-11295132.post-27104240737773249232020-09-05T12:45:00.003-07:002020-09-05T12:56:49.698-07:00Some pointers to things not in this blog<h2 style="text-align: left;">Some pointers to things not in this blog</h2><p style="text-align: left;">One reason I haven't blogged much recently is that my tolerance for blogger.com has reached its limit and I've been too lazy to build my own platform supporting mathematics and code. (For example, I can't get previewing on blogger to work today so I'm just publishing this and hope the reformatting is acceptable.) But that doesn't mean I haven't posted stuff publicly. So here are some thematically related links to things I've written on github and colab.</p><span><a name='more'></a></span><h3 style="text-align: left;">Continuations, effects and runners</h3><p><a href=" https://dpiponi.github.io/cont.html">How to slice your code into continuations</a></p><p><a href="https://colab.research.google.com/drive/1HGs59anVC2AOsmt7C4v8yD6v8gZSJGm6">Handling Effects with Jax</a><br /></p><p><span face="" style="background-color: white; caret-color: rgb(36, 41, 46); color: #24292e; font-size: 14px;"><a href="https://gist.github.com/dpiponi/5ee0c535c6f60bf80b11ed63c4cce974">Are these runners?</a><br /></span>Just a little snippet of code to illustrate how Python's coroutines can be used to support composable runners. See <a href="Runners in Action">http://math.andrej.com/2019/10/28/runners-in-action/</a><br />(The answer is yes.)</p><span><!--more--></span><h3 style="text-align: left;">Parallel audio</h3><div><a href="https://colab.research.google.com/drive/1Ghi4r1tSlDFmPbslj7gQdFK-BBd4Ugmf">Parallel audio synthesis</a></div><div><br /></div><div><a href="https://colab.research.google.com/drive/1IU3czYs9rS_QnEo6qEb2-gR8ILM9auli">Simulating Plucked Strings with TensorFlow</a></div><div><br /></div><span><!--more--></span><div><br /></div><div>FWIW I think Colab might be my favourite place to share stuff publicly if it supported environments other than Python.</div>sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com0tag:blogger.com,1999:blog-11295132.post-74512521781408341902018-12-08T12:47:00.000-08:002018-12-08T12:47:06.793-08:00Why is nuclear fusion so hard?<h3>Why does water fall out of an inverted cup?</h3><div>Before considering nuclear fusion, let's consider something much more familiar. If you turn a cup full of water upside down, the water falls out. Why is this? It seems obvious: with nothing supporting the water from below, gravity pulls the water out of the cup. But let's look a little closely at the left side of Figure 1. We have water in an inverted cup under the influence of gravity. But there is air pressure at the surface of the water. At sea level, atmospheric pressure is about the weight of 1kg per cm<sup>2</sup>, easily enough to keep the water in the cup. So now we still have our question: why does it come out?<br /><br />Consider the diagram on the right of Figure 1. I have drawn some ripples on the surface of the water. Air pressure provides a force perpendicular to the water surface which means that around the ripples we no longer have a force pointing straight up. The force points partly sideways and this serves to deform the shape of the water surface. But as the water surface becomes even more deformed the forces become even more distorted away from vertical causing a feedback loop. So we can expect even the tiniest of ripples to grow to the point where the water completely changes shape and it eventually deforms its way out of the glass.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-8iQPuUZ5dB0/XAwTKQTbTbI/AAAAAAAB8-s/oehmFlwXdQodM19WiEtdtfOlYEiw3OBMQCLcBGAs/s1600/cups-4.png" imageanchor="1"><img border="0" data-original-height="862" data-original-width="1396" height="393" src="https://4.bp.blogspot.com/-8iQPuUZ5dB0/XAwTKQTbTbI/AAAAAAAB8-s/oehmFlwXdQodM19WiEtdtfOlYEiw3OBMQCLcBGAs/s640/cups-4.png" width="640" /></a></div><div style="text-align: center;">Figure 1<br /><br /></div><div class="separator" style="clear: both; text-align: left;">This is an example of an instability - in this case it's called the <a href="https://en.wikipedia.org/wiki/Rayleigh%E2%80%93Taylor_instability">Rayleigh-Taylor instability</a>. And this suggests an approach to keeping water in an inverted cup - suppress the ripples. One way to do this is to fill the cup to the top and place a piece of paper over the surface. Here is an example on YouTube:</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div style="text-align: center;"><iframe width="320" height="266" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/65T4ReLkjCg/0.jpg" src="https://www.youtube.com/embed/65T4ReLkjCg?feature=player_embedded" frameborder="0" allowfullscreen></iframe></div><br /><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">Another approach is to apply thickening agents to the water to dampen the ripples. Here's an example showing how hard it is to get jello out of a bowl:</div><div class="separator" style="clear: both; text-align: center;"><br /></div><div style="text-align: center;"><iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/cDhONSaIYAg/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/cDhONSaIYAg?feature=player_embedded" width="320"></iframe></div><h3 style="text-align: left;">Nuclear fusion</h3><div class="separator" style="clear: both; text-align: left;">And now I can discuss one of the challenges facing controlled nuclear fusion. When two suitable atomic nuclei are made to fuse they release energy. But nuclei are positively charged and repel. So in order to cause nuclei to fuse we need to overcome this repulsion, typically by making them collide at very high velocities. To achieve this in useful numbers one way is to heat a suitable fuel to a high temperature in a small enough volume. Temperatures of at least 10,000,000K are required, and possibly orders of magnitude higher. We know how to do the heating, but we also have to confine the fuel - which at these temperatures takes the form of a plasma because its nuclei and electrons become separated. We can't simply use a solid container to hold the plasma as no container can survive such temperatures. Instead we need some other confinement method such as <a href="https://en.wikipedia.org/wiki/Magnetic_confinement_fusion">magnetic confinement,</a> where we use magnetic fields to control the plasma, or <a href="https://en.wikipedia.org/wiki/Inertial_confinement_fusion">inertial confinement</a> where we use explosions around the fuel to drive the fuel into a small volume. Both of these suffer from the same problem: we're using a force to control a fluid and this is subject to instabilities. Confining a plasma is like trying to keep water in an inverted cup by blowing on the water with fans. It's really hard. What's more, plasma has incredibly complex dynamics that is subject to many different kinds of instability. Around fifty <a href="https://en.wikipedia.org/wiki/Plasma_stability#List_of_plasma_instabilities">named instabilities</a> are listed on Wikipedia. Here is an example of one in a real plasma. It is a kink instability that causes a line of plasma with a current through it to start acquiring a corkscrew shape so that it eventually tears itself apart:</div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/Fvhg3mbcvAE/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/Fvhg3mbcvAE?feature=player_embedded" width="320"></iframe></div><div style="text-align: center;"><br /></div><div style="text-align: center;"><div style="text-align: left;">And that's one of the biggest challenges facing fusion energy today: we don't know how to keep a plasma stable for long enough.</div></div><h3 style="text-align: left;">Is there any hope for fusion?</h3><div style="text-align: center;"><div style="text-align: left;">The description above is qualitative. To determine how severe any particular instability is we need to run experiments or use some theory. Some kinds of instability are slow enough that we can control them. For example bicycles are unstable, but many of us eventually learn to ride them. In a thermonuclear weapon stability is only required for a short amount of time, something that was achievable back in the 50s. And sometimes the feedback loops in the physics are less troublesome than might be expected so that Tokamaks (and other configurations) have operating modes that turn out to be relatively stable, for example the "<a href="https://en.wikipedia.org/wiki/High-confinement_mode">H-mode</a>". So maybe we'll have working fusion reactors in 20 years, or 30 years, or is it 50?</div></div>sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com0tag:blogger.com,1999:blog-11295132.post-34988775472037936922018-10-20T16:22:00.000-07:002018-10-20T16:22:50.347-07:00Running from the past<BR><b>Preface</b><p>Functional programming encourages us to program without mutable state. Instead we compose functions that can be viewed as state transformers. It's a change of perspective that can have a big impact on how we reason about our code. But it's also a change of perspective that can be useful in mathematics and I'd like to give an example: a really beautiful technique that alows you to sample from the infinite limit of a probability distribution without needing an infinite number of operations. (Unless you're infinitely unlucky!) <P><BR><BR><b>Markov Chains</b><p>A <a href="https://en.wikipedia.org/wiki/Markov_chain">Markov chain</a> is a sequence of random states where each state is drawn from a random distribution that possibly depends on the previous state, but not on any earlier state. So it is a sequence <img src="https://chart.googleapis.com/chart?cht=tx&chl=X_0%2C%20X_1%2C%20X_2%2C%20%5Cldots" style="vertical-align:middle"> such that <img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28X_%7Bi%2B1%7D%3Dx%7CX_0%2CX_1%2C%5Cldots%2CX_i%29%20%3D%20P%28X_%7Bi%2B1%7D%3Dx%7CX_i%29" style="vertical-align:middle"> for all <img src="https://chart.googleapis.com/chart?cht=tx&chl=i%5Cge0" style="vertical-align:middle">. A basic example might be a model of the weather in which each day is either sunny or rainy but where it's more likely to be rainy (or sunny) if the previous day was rainy (or sunny). (And to be technically correct: having information about two days or earlier doesn't help us if we know yesterday's weather.) <P><BR>Like imperative code, this description is stateful. The state at step <img src="https://chart.googleapis.com/chart?cht=tx&chl=i%2B1" style="vertical-align:middle"> depends on the state at step <img src="https://chart.googleapis.com/chart?cht=tx&chl=i" style="vertical-align:middle">. Probability is often easier to reason about when we work with independent identically drawn random variables and our <img src="https://chart.googleapis.com/chart?cht=tx&chl=X_i" style="vertical-align:middle"> aren't of this type. But we can eliminate the state from our description using the same method used by functional programmers. <P><BR>Let's choose a Markov chain to play with. I'll pick one with 3 states called <img src="https://chart.googleapis.com/chart?cht=tx&chl=A" style="vertical-align:middle">, <img src="https://chart.googleapis.com/chart?cht=tx&chl=B" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=C" style="vertical-align:middle"> and with transition probabilities given by <img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28X_%7Bi%2B1%7D%3Dy%7CX_i%3Dx%29%3DT_%7Bxy%7D" style="vertical-align:middle"> where <img src="https://chart.googleapis.com/chart?cht=tx&chl=T%3D%5Cbegin%7Bpmatrix%7D%0A%5Cfrac%7B1%7D%7B2%7D%26%20%5Cfrac%7B1%7D%7B2%7D%26%200%5C%5C%0A%5Cfrac%7B1%7D%7B3%7D%26%20%5Cfrac%7B1%7D%7B3%7D%26%20%5Cfrac%7B1%7D%7B3%7D%5C%5C%0A0%26%20%5Cfrac%7B1%7D%7B2%7D%26%20%5Cfrac%7B1%7D%7B2%7D%5C%5C%0A%5Cend%7Bpmatrix%7D%0A" style="vertical-align:middle"><P><BR>Here's a diagram illustrating our states: <pre><br /><a href="https://1.bp.blogspot.com/-0EhvUnytTgI/W8uq-vPdTKI/AAAAAAAB8d0/N_SfRTwn9iERwrC8aNE6BcSx7YUuCu-jgCLcBGAs/s1600/markov.gif" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-0EhvUnytTgI/W8uq-vPdTKI/AAAAAAAB8d0/N_SfRTwn9iERwrC8aNE6BcSx7YUuCu-jgCLcBGAs/s640/markov.gif" width="640" height="226" data-original-width="800" data-original-height="282" /></a><br /></pre><P><BR><BR><b>Implementation</b><p>First some imports: <P><BR><pre><br />> {-# LANGUAGE LambdaCase #-}<br />> {-# LANGUAGE TypeApplications #-}<br /><P><BR><br />> import Data.Sequence(replicateA)<br />> import System.Random<br />> import Control.Monad.State<br />> import Control.Monad<br />> import Data.List<br />> import Data.Array<br /><P><BR><br /></pre>And now the type of our random variable: <P><BR><pre><br />> data ABC = A | B | C deriving (Eq, Show, Ord, Enum, Bounded)<br /><P><BR><br /></pre>We are now in a position to simulate our Markov chain. First we need some random numbers drawn uniformly from [0, 1]: <P><BR><pre><br />> uniform :: (RandomGen gen, MonadState gen m) => m Double<br />> uniform = state random<br /><P><BR><br /></pre>And now the code to take a single step in the Markov chain: <P><BR><pre><br />> step :: (RandomGen gen, MonadState gen m) => ABC -> m ABC<br />> step A = do<br />> a <- uniform<br />> if a < 0.5<br />> then return A<br />> else return B<br />> step B = do<br />> a <- uniform<br />> if a < 1/3.0<br />> then return A<br />> else if a < 2/3.0<br />> then return B<br />> else return C<br />> step C = do<br />> a <- uniform<br />> if a < 0.5<br />> then return B<br />> else return C<br /><P><BR><br /></pre>Notice how the <tt>step</tt> function generates a new state at random in a way that depends on the previous state. The <tt>m ABC</tt> in the type signature makes it clear that we are generating random states at each step. <P><BR>We can simulate the effect of taking <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> steps with a function like this: <P><BR><pre><br />> steps :: (RandomGen gen, MonadState gen m) => Int -> ABC -> m ABC<br />> steps 0 i = return i<br />> steps n i = do<br />> i <- steps (n-1) i<br />> step i<br /><P><BR><br /></pre>We can run for 100 steps, starting with <img src="https://chart.googleapis.com/chart?cht=tx&chl=A" style="vertical-align:middle">, with a line like so: <P><BR><pre><br />*Main> evalState (steps 3 A) gen<br />B<br /><P><BR><br /></pre>The starting state of our random number generator is given by <tt>gen</tt>. <P><BR>Consider the distribution of states after taking <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> steps. For Markov chains of this type, we know that as <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> goes to infinity the distribution of the <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle">th state approaches a limiting "stationary" distribution. There are frequently times when we want to sample from this final distribution. For a Markov chain as simple as this example, you can solve exactly to find the limiting distribution. But for real world problems this can be intractable. Instead, a popular solution is to pick a large <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> and hope it's large enough. As <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> gets larger the distribution gets closer to the limiting distribution. And that's the problem I want to solve here - sampling from the limit. It turns out that by thinking about random functions instead of random states we can actually sample from the limiting distribution exactly. <P><BR><BR><b>Some random functions</b><p><P><BR>Here is a new version of our random step function: <P><BR><pre><br />> step' :: (RandomGen gen, MonadState gen m) => m (ABC -> ABC)<br />> step' = do<br />> a <- uniform<br />> return $ \case<br />> A -> if a < 0.5 then A else B<br />> B -> if a < 1/3.0<br />> then A<br />> else if a < 2/3.0 then B else C<br />> C -> if a < 0.5 then B else C<br /><P><BR><br /></pre>In many ways it's similar to the previous one. But there's one very big difference: the type signature <tt>m (ABC -> ABC)</tt> tells us that it's returning a random function, not a random state. We can simulate the result of taking 10 steps, say, by drawing 10 random functions, composing them, and applying the result to our initial state: <P><BR><pre><br />> steps' :: (RandomGen gen, MonadState gen m) => Int -> m (ABC -> ABC)<br />> steps' n = do<br />> fs <- replicateA n step'<br />> return $ foldr (flip (.)) id fs<br /><P><BR><br /></pre>Notice the use of <tt>flip</tt>. We want to compose functions <img src="https://chart.googleapis.com/chart?cht=tx&chl=f_%7Bn-1%7D%5Ccirc%20f_%7Bn-2%7D%5Ccirc%5Cldots%5Ccirc%20f_0" style="vertical-align:middle">, each time composing on the left by the new <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle">. This means that for a fixed seed <tt>gen</tt>, each time you increase <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> by 1 you get the next step in a single simulation: (BTW I used <tt>replicateA</tt> instead of <tt>replicateM</tt> to indicate that these are independent random draws. It may be well known that you can use <tt>Applicative</tt> instead of <tt>Monad</tt> to indicate independence but I haven't seen it written down.) <P><BR><pre><br />*Main> [f A | n <- [0..10], let f = evalState (steps' n) gen]<br />[A,A,A,B,C,B,A,B,A,B,C]<br /><P><BR><br /></pre>When I first implemented this I accidentally forgot the <tt>flip</tt>. So maybe you're wondering what effect removing the <tt>flip</tt> has? The effect is about as close to a miracle as I've seen in mathematics. It allows us to sample from the limiting distribution in a finite number of steps! <P><BR>Here's the code: <P><BR><pre><br />> steps_from_past :: (RandomGen gen, MonadState gen m) => Int -> m (ABC -> ABC)<br />> steps_from_past n = do<br />> fs <- replicateA n step'<br />> return $ foldr (.) id fs<br /><P><BR><br /></pre>We end up building <img src="https://chart.googleapis.com/chart?cht=tx&chl=f_%7B0%7D" style="vertical-align:middle"> <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ccirc" style="vertical-align:middle"> <img src="https://chart.googleapis.com/chart?cht=tx&chl=f_%7B1%7D" style="vertical-align:middle"> <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cldots" style="vertical-align:middle"> <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ccirc" style="vertical-align:middle"> <img src="https://chart.googleapis.com/chart?cht=tx&chl=f_%7Bn-1%7D" style="vertical-align:middle">. This is still a composition of <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> independent identically distributed functions and so it's still drawing from exactly the same distribution as <tt>steps'</tt>. Nonetheless, there is a difference: for a particular choice of seed, <tt>steps_from_past n</tt> no longer gives us a sequence of states from a Markov chain. Running with argument <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> draws a random composition of <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> functions. But if you increase <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> by 1 you don't add a new step at the end. Instead you effectively restart the Markov chain with a new first step generated by a new random seed. <P><BR>Try it and see: <P><BR><pre><br />*Main> [f A | n <- [0..10], let f = evalState (steps_from_past n) gen]<br />[A, A, A, A, A, A, A, A, A, A]<br /><P><BR><br /></pre>Maybe that's surprising. It seems to get stuck in one state. In fact, we can try applying the resulting function to all three states. <P><BR><pre><br />*Main> [fmap f [A, B, C] | n <- [0..10], let f = evalState (steps_from_past n) gen]<br />[[A,B,C],[A,A,B],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A],[A,A,A]]<br /><P><BR><br /></pre>In other words, for <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> large enough we get the constant function. <P><BR>Think of it this way: If <tt>f</tt> isn't injective then it's possible that two states get collapsed to the same state. If you keep picking random <tt>f</tt>'s it's inevitable that you will eventually collapse down to the point where all arguments get mapped to the same state. Once this happens, we'll get the same result no matter how large we take <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle">. If we can detect this then we've found the limit of <img src="https://chart.googleapis.com/chart?cht=tx&chl=f_%7B0%7D" style="vertical-align:middle"> <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ccirc" style="vertical-align:middle"> <img src="https://chart.googleapis.com/chart?cht=tx&chl=f_%7B1%7D" style="vertical-align:middle"> <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cldots" style="vertical-align:middle"> <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ccirc" style="vertical-align:middle"> <img src="https://chart.googleapis.com/chart?cht=tx&chl=f_%7Bn-1%7D" style="vertical-align:middle"> as <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> goes to infinity. But because we know composing forwards and composing backwards lead to draws from the same distribution, the limiting backward composition must actually be a draw from the same distribution as the limiting forward composition. That <tt>flip</tt> can't change what probability distribution we're drawing from - just the dependence on the seed. So the value the constant function takes is actually a draw from the limiting stationary distribution. <P><BR>We can code this up: <P><BR><pre><br />> all_equal :: (Eq a) => [a] -> Bool<br />> all_equal [] = True<br />> all_equal [_] = True<br />> all_equal (a : as) = all (== a) as<br /><P><BR><br />> test_constant :: (Bounded a, Enum a, Eq a) => (a -> a) -> Bool<br />> test_constant f =<br />> all_equal $ map f $ enumFromTo minBound maxBound<br /><P><BR><br /></pre>This technique is called coupling from the past. It's "coupling" because we've arranged that different starting points coalesce. And it's "from the past" because we're essentially asking answering the question of what the outcome of a simulation would be if we started infinitely far in the past. <P><BR><pre><br />> couple_from_past :: (RandomGen gen, MonadState gen m, Enum a, Bounded a, Eq a) =><br />> m (a -> a) -> (a -> a) -> m (a -> a)<br />> couple_from_past step f = do<br />> if test_constant f<br />> then return f<br />> else do<br />> f' <- step<br />> couple_from_past step (f . f')<br /><P><BR><br /></pre>We can now sample from the limiting distribution a million times, say: <P><BR><pre><br />*Main> let samples = map ($ A) $ evalState (replicateA 1000000 (couple_from_past step' id)) gen<br /><P><BR><br /></pre>We can now count how often <tt>A</tt> appears: <P><BR><pre><br />*Main> fromIntegral (length $ filter (== A) samples)/1000000<br />0.285748<br /><P><BR><br /></pre>That's a pretty good approximation to <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7B2%7D%7B7%7D" style="vertical-align:middle">, the exact answer that can be found by finding the eigenvector of the transition matrix corresponding to an eigenvalue of 1. <P><BR><pre><br />> gen = mkStdGen 669<br /><P><BR><br /></pre><BR><b>Notes</b><p>The technique of <a href="https://en.wikipedia.org/wiki/Coupling_from_the_past">coupling from the past</a> first appeared in a paper by Propp and Wilson. The paper <a href="https://www.stat.berkeley.edu/~aldous/205B/511.pdf">Iterated Random Functions</a> by Persi Diaconis gave me a lot of insight into it. Note that the code above is absolutely not how you'd implement this for real. I wrote the code that way so that I could switch algorithm with the simple removal of a <tt>flip</tt>. In fact, with some clever tricks you can make this method work with state spaces so large that you couldn't possibly hope to enumerate all starting states to detect if convergence has occurred. Or even with uncountably large state spaces. But I'll let you read the Propp-Wilson paper to find out how. sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com0tag:blogger.com,1999:blog-11295132.post-66112247982330166262017-10-14T13:42:00.000-07:002017-10-14T14:02:01.380-07:00A tail we don't need to wag<BR><b>Introduction</b><p>I've been reading a little about <a href="https://en.wikipedia.org/wiki/Concentration_inequality">concentration inequalities</a> recently. I thought it would be nice to see if you can use the key idea, if not the actual theorems, to reduce the complexity of computing the probability distribution of the outcome of stochastic simulations. Examples might include <a href="https://en.wikipedia.org/wiki/Random_walk">random walks</a>, or <a href="https://en.wikipedia.org/wiki/Queueing_theory">queues</a>. <P><BR>The key idea behind concentration inequalities is that very often most of the probability is owned by a small proportion of the possible outcomes. For example, if we toss a fair coin enough (say <img src="https://chart.googleapis.com/chart?cht=tx&chl=N" style="vertical-align:middle">) times we expect the number of heads to lie within <img src="https://chart.googleapis.com/chart?cht=tx&chl=2%5Csqrt%7BN%7D" style="vertical-align:middle"> of the mean about 99.99% of the time despite there being <img src="https://chart.googleapis.com/chart?cht=tx&chl=N%2B1" style="vertical-align:middle"> different total numbers possible. The probable outcomes tend to concentrate around the expectation. On the other hand, if we consider not the total number of heads, but the possible sequences of <img src="https://chart.googleapis.com/chart?cht=tx&chl=N" style="vertical-align:middle"> tosses, there are <img src="https://chart.googleapis.com/chart?cht=tx&chl=2%5EN" style="vertical-align:middle"> possibilities, all equally likely. In this case there is no concentration. So a key ingredient here is a reduction operation: in this case reducing a sequence of tosses to a count of the number that came up heads. This is something we can use in a computer program. <P><BR>I (and many others) have written about the <a href="http://blog.sigfpe.com/2007/03/monads-vector-spaces-and-quantum.html">"vector space" monad</a> that can be used to compute probability distributions of outcomes of simulations and I'll assume some familiarity with that. Essentially it is a "weighted list" monad which is similar to the list monad except that in addition to tracking all possible outcomes, it also propagates a probability along each path. Unfortunately it needs to follow through every possible path through a simulation. For example, in the case of simulating <img src="https://chart.googleapis.com/chart?cht=tx&chl=N" style="vertical-align:middle"> coin tosses it needs to track <img src="https://chart.googleapis.com/chart?cht=tx&chl=2%5EN" style="vertical-align:middle"> different possiblities, even though we're only interested in the <img src="https://chart.googleapis.com/chart?cht=tx&chl=N%2B1" style="vertical-align:middle"> possible sums. If, after each bind operation of the monad, we could collect together all paths that give the same total then we could make this code much more efficient. The catch is that to collect together elements of a type the elements need to be comparable, for example instances of <tt>Eq</tt> or <tt>Ord</tt>. This conflicts with the type of <tt>Monad</tt> which requires that we can use the <tt>>>= :: m a -> (a -> m b) -> m b</tt> and <tt>return :: a -> m a</tt> functions with any types <tt>a</tt> and <tt>b</tt>. <P><BR>I'm going to deal with this by adapting a technique presented by Oleg Kiselyov for <a href="http://okmij.org/ftp/Haskell/set-monad.html">efficiently implementing the Set monad</a>. Instead of <tt>Set</tt> I'm going to use the <tt>Map</tt> type to represent probability distributions. These will store maps saying, for each element of a type, what the probability of that element is. So part of my code is going to be a direct translation of that code to use the <tt>Map</tt> type instead of the <tt>Set</tt> type. <P><BR><pre><br />> {-# LANGUAGE GADTs, FlexibleInstances #-}<br />> {-# LANGUAGE ViewPatterns #-}<br /><P><BR><br />> module Main where<br /><P><BR><br />> import Control.Monad<br />> import Control.Arrow<br />> import qualified Data.Map as M<br />> import qualified Data.List as L<br /><P><BR><br /></pre>The following code is very similar to Oleg's. But for first reading I should point out some differences that I want you to ignore. The type representing a probability distribution is <tt>P</tt>: <P><BR><pre><br />> data P p a where<br />> POrd :: Ord a => p -> M.Map a p -> P p a<br />> PAny :: p -> [(a, p)] -> P p a<br /><P><BR><br /></pre>But note how the constructors take two arguments - a number that is a probability, in addition to a weighted <tt>Map</tt> or list. For now pretend that first argument is zero and that the functions called <tt>trimXXX</tt> act similarly to the identity: <P><BR><pre><br />> instance (Ord p, Num p) => Functor (P p) where<br />> fmap = liftM<br /><P><BR><br />> instance (Ord p, Num p) => Applicative (P p) where<br />> pure = return<br />> (<*>) = ap<br /><P><BR><br />> instance (Ord p, Num p) => Monad (P p) where<br />> return x = PAny 0 [(x, 1)]<br />> m >>= f = <br />> let (e, pdf) = unP m<br />> in trimAdd e $ collect $ map (f *** id) pdf<br /><P><BR><br />> returnP :: (Ord p, Num p, Ord a) => a -> P p a<br />> returnP a = POrd 0 $ M.singleton a 1<br /><P><BR><br />> unP :: P p a -> (p, [(a, p)])<br />> unP (POrd e pdf) = (e, M.toList pdf)<br />> unP (PAny e pdf) = (e, pdf)<br /><P><BR><br />> fromList :: (Num p, Ord a) => [(a, p)] -> M.Map a p<br />> fromList = M.fromListWith (+)<br /><P><BR><br />> union :: (Num p, Ord a) => M.Map a p -> M.Map a p -> M.Map a p<br />> union = M.unionWith (+)<br /><P><BR><br />> scaleList :: Num p => p -> [(a, p)] -> [(a, p)]<br />> scaleList weight = map (id *** (weight *))<br /><P><BR><br />> scaleMap :: (Num p, Ord a) => p -> M.Map a p -> M.Map a p<br />> scaleMap weight = fromList . scaleList weight . M.toList<br /><P><BR><br /></pre>This is a translation of Oleg's crucial function that allows us to take a weighted list of probability distributions and flatten them down to a single probability distribution: <P><BR><pre><br />> collect :: Num p => [(P p a, p)] -> P p a<br />> collect [] = PAny 0 []<br />> collect ((POrd e0 pdf0, weight) : rest) =<br />> let wpdf0 = scaleMap weight pdf0<br />> in case collect rest of<br />> POrd e1 pdf1 -> POrd (weight*e0+e1) $ wpdf0 `union` pdf1<br />> PAny e1 pdf1 -> POrd (weight*e0+e1) $ wpdf0 `union` fromList pdf1<br />> collect ((PAny e0 pdf0, weight) : rest) =<br />> let wpdf0 = scaleList weight pdf0<br />> in case collect rest of<br />> POrd e1 pdf1 -> POrd (weight*e0+e1) $ fromList wpdf0 `union` pdf1<br />> PAny e1 pdf1 -> PAny (weight*e0+e1) $ wpdf0 ++ pdf1<br /><P><BR><br /></pre>But now I really must explain what the first argument to <tt>POrd</tt> and <tt>PAny</tt> is and why I have all that "trimming". <P><BR>Even though the <tt>collect</tt> function allows us to reduce the number of elements in our PDFs, we'd like to take advantage of concentration of probability to reduce the number even further. The <tt>trim</tt> function keeps only the top <img src="https://chart.googleapis.com/chart?cht=tx&chl=N" style="vertical-align:middle"> probabilities in a PDF, discarding the rest. To be honest, this is the only point worth taking away from what I've written here :-) <P><BR>When we throw away elements of the PDF our probabilities no longer sum to 1. So I use the first argument of the constructors as a convenient place to store the amount of probability that I've thrown away. The <tt>trim</tt> function keeps the most likely <img src="https://chart.googleapis.com/chart?cht=tx&chl=N" style="vertical-align:middle"> outcomes and sums the probability of the remainder. I don't actually need to keep track of what has been discarded. In principle we could reconstruct this value by looking at how much the probabilities in our trimmed partial PDFs fall short of summing to 1. But confirming that our discarded probability and our partial PDF sums to 1 gives a nice safety check for our code and can give us some warning if numerical errors start creeping in. I'll call the total discarded probability the <i>tail</i> probability. <P><BR>Here is the core function to keep the top <img src="https://chart.googleapis.com/chart?cht=tx&chl=N" style="vertical-align:middle"> values. In this case <img src="https://chart.googleapis.com/chart?cht=tx&chl=N" style="vertical-align:middle"> is given by a global constant called <tt>trimSize</tt>. (I'll talk about how to do this better later.) <P><BR><pre><br />> trimList :: (Ord p, Num p) => [(a, p)] -> (p, [(a, p)])<br />> trimList ps =<br />> let (keep, discard) = L.splitAt trimSize (L.sortOn (negate . snd) ps)<br />> in (sum (map snd discard), keep)<br /><P><BR><br />> trimAdd :: (Ord p, Num p) => p -> P p a -> P p a<br />> trimAdd e' (POrd e pdf) =<br />> let (f, trimmedPdf) = trimList (M.toList pdf)<br />> in POrd (e'+e+f) (M.fromList trimmedPdf)<br />> trimAdd e' (PAny e pdf) =<br />> let (f, trimmedPdf) = trimList pdf<br />> in PAny (e'+e+f) trimmedPdf<br /><P><BR><br />> runP :: (Num p, Ord a) => P p a -> (p, M.Map a p)<br />> runP (POrd e pdf) = (e, pdf)<br />> runP (PAny e pdf) = (e, fromList pdf)<br /><P><BR><br /></pre>And now some functions representing textbook probability distributions. First the uniform distribution on a finite set. Again this is very similar to Oleg's <tt>chooseOrd</tt> function apart from the fact that it assigns weights to each element: <P><BR><pre><br />> chooseP :: (Fractional p, Ord p, Ord a) =><br />> [a] -> P p a<br />> chooseP xs = let p = 1/fromIntegral (length xs)<br />> in POrd 0 $ fromList $ map (flip (,) p) xs<br /><P><BR><br /></pre>And the Bernoulli distribution, i.e. tossing a <tt>Bool</tt> coin that comes up <tt>True</tt> with probability <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle">: <P><BR><pre><br />> bernoulliP :: (Fractional p, Ord p) =><br />> p -> P p Bool<br />> bernoulliP p = POrd 0 $ fromList $ [(False, 1-p), (True, p)]<br /><P><BR><br /></pre>Now we can try a random walk in one dimension. At each step we have a 50/50 chance of standing still or taking a step to the right: <P><BR><pre><br />> random_walk1 :: Int -> P Double Int<br />> random_walk1 0 = returnP 0<br />> random_walk1 n = do<br />> a <- random_walk1 (n-1)<br />> b <- chooseP [0, 1]<br />> returnP $ a+b<br /><P><BR><br /></pre>Below in <tt>main</tt> we take 2048 steps but only track 512 probabilities. The tail probability in this case is about <img src="https://chart.googleapis.com/chart?cht=tx&chl=10%5E%7B-29%7D" style="vertical-align:middle">. So only tracking 1/4 of the outcomes has had almost no impact on the numbers. This also illustrates why it is good to track the tail probabilities rather than inferring them from the missing probabilities in the bulk of the PDF - they can be so small they vanish compared to floating poimnt errors. We can afford to track a lot fewer than 512 (out of 2049 possible) outcomes and still have a good representative PDF. <blockquote><pre><br /><a href="https://2.bp.blogspot.com/-5g2q61iC3nw/Wd-XI9KZq9I/AAAAAAAA2Ao/6Sva9yVlIcsO4ZoETHA2MWvbc_nfxNaggCLcBGAs/s1600/figure_1.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-5g2q61iC3nw/Wd-XI9KZq9I/AAAAAAAA2Ao/6Sva9yVlIcsO4ZoETHA2MWvbc_nfxNaggCLcBGAs/s400/figure_1.png" width="400" height="300" data-original-width="800" data-original-height="600" /></a><br /></pre></blockquote>Now here's a two-dimensional random walk for 32 steps. The tail probability is about <img src="https://chart.googleapis.com/chart?cht=tx&chl=2%5Ctimes%2010%5E%7B-5%7D" style="vertical-align:middle"> so we are getting a reasonably representative PDF. We have to run fewer steps than before, however, because the space of possible outcomes spans two dimensions, meaning that reduction doesn't help as much as it does in one dimension. <P><BR><pre><br />> random_walk2 :: Int -> (Int, Int) -> P Double (Int, Int)<br />> random_walk2 0 (x, y) = returnP (x, y)<br />> random_walk2 n (x, y) = do<br />> (x',y') <- random_walk2 (n-1) (x, y)<br />> dx <- chooseP [-1, 1]<br />> dy <- chooseP [-1, 1]<br />> returnP (x'+dx, y'+dy)<br /><P><BR><br /></pre>One last simulation. This is a queing scenario. Tasks come in once every tick of the clock. There are four queues a task can be assigned to. A task is assigned to the shortest queue. Meanwhile each queue as a 1/4 probability of clearing one item at each tick of the clock. We build the PDF for the maximum length any queue has at any time. <blockquote><pre><br /><a href="https://1.bp.blogspot.com/-FOEzOGZKNuA/Wd-kGpIp_II/AAAAAAAA2BE/O494hiM3KhUXdv0OUL45itF3TaU9eKNfwCLcBGAs/s1600/Untitled.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-FOEzOGZKNuA/Wd-kGpIp_II/AAAAAAAA2BE/O494hiM3KhUXdv0OUL45itF3TaU9eKNfwCLcBGAs/s400/Untitled.png" width="400" height="276" data-original-width="735" data-original-height="508" /></a><br /></pre></blockquote>The first argument to <tt>queue</tt> is the number of ticks of the clock. The second argument is the list of lengths of the queues. It returns a PDF, not just on the current queue size, but also on the longest queue it has seen. <P><BR><pre><br />> queue :: Int -> [Int] -> P Double (Int, [Int])<br />> queue 0 ls = returnP (maximum ls, ls)<br />> queue n ls = do<br />> (longest, ls1) <- queue (n-1) ls<br />> ls2 <- forM ls1 $ \l -> do<br />> served <- bernoulliP (1/4)<br />> returnP $ if served && l > 0 then l-1 else l<br />> let ls3 = L.sort $ head ls2+1 : tail ls2<br />> returnP (longest `max` maximum ls3, ls3)<br /><P><BR><br /></pre>For the queing simulation the tail probability is around <img src="https://chart.googleapis.com/chart?cht=tx&chl=5%5Ctimes%2010%5E%7B-4%7D" style="vertical-align:middle"> despite the fact that we have discarded a vast possible set of possible outcomes. <blockquote><pre><br /><a href="https://1.bp.blogspot.com/-MomMNzoE4co/Wd-gtJI4kKI/AAAAAAAA2A4/RkRPJJ_RmHI6LLJVl-Wsh7DGiPQ1fwblwCLcBGAs/s1600/figure_2.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-MomMNzoE4co/Wd-gtJI4kKI/AAAAAAAA2A4/RkRPJJ_RmHI6LLJVl-Wsh7DGiPQ1fwblwCLcBGAs/s400/figure_2.png" width="400" height="300" data-original-width="800" data-original-height="600" /></a><br /></pre></blockquote>It's a little ugly that <tt>trimSize</tt> is a global constant: <P><BR><pre><br />> trimSize = 512<br /><P><BR><br /></pre>The correct solution is probably to separate the probability "syntax" from its "semantics". In other words, we should implement a free monad supporting the language of probability with suitable constructors for <tt>bernoulliP</tt> and <tt>choiceP</tt>. We can then write a separate interpreter which takes a <tt>trimSize</tt> as argument. This has another advantage too: the <tt>Monad</tt> above isn't a true monad. It uses a greedy approach to discarding probabilities and different rearrangements of the code, that ought to give identical results, may end up diferent. By using a free monad we ensure that our interface is a true monad and we can put the part of the code that breaks the monad laws into the interpreter. The catch is that my first attempt at writing a free monad resulted in code with poor performance. So I'll leave an efficient version as an exercise :-) <P><BR><pre><br />> main = do<br />> print $ runP $ random_walk1 2048<br />> print $ runP $ random_walk2 32 (0, 0)<br />> print $ runP $ do<br />> (r, _) <- queue 128 [0, 0, 0, 0]<br />> returnP r<br /></pre>sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com0tag:blogger.com,1999:blog-11295132.post-83054615681596828402017-08-11T20:22:00.000-07:002017-08-11T20:22:12.403-07:00What is a photon?<BR><b>Introduction</b><p>Popular science writing about quantum mechanics leaves many people full of questions about the status of photons. I want to answer some of these without using any tricky mathematics. <P><BR>One of the challenges is that photons are very different to ordinary everyday objects like billiard balls. This is partly because photons are described by quantum mechanics whereas billiard balls are better modelled with classical Newtonian mechanics. Quantum mechanics defies many of our intuitions. But it's also because the word <i>photon</i> plays by different linguistic rules to <i>billiard ball</i>. I hope to explain why. <P><BR>One of my goals is to avoid saying anything original. I'm largely going remove the mathematics from material I first learnt from three or so courses I took at Cambridge University many years ago: Quantum Mechanics, Solid State Physics and Quantum Field Theory. I also learnt about some of this from David Miller at Stanford University who talked a little about what properties it is meaningful to apply to a photon. (I hope I haven't misrepresented him too badly.) <P><BR><BR><b>The simple harmonic oscillator</b><p><P><BR>Here's a mass hanging on a spring: <blockquote><pre><br /><a href="https://3.bp.blogspot.com/-bUIzej5OnU0/WY5xmrDT-hI/AAAAAAAA1Gw/SNXMGNjT9q8PrPDh8GRoQkgpilAtPGpPQCLcBGAs/s1600/figure0.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-bUIzej5OnU0/WY5xmrDT-hI/AAAAAAAA1Gw/SNXMGNjT9q8PrPDh8GRoQkgpilAtPGpPQCLcBGAs/s400/figure0.png" width="178" height="400" data-original-width="160" data-original-height="359" /></a><br /></pre></blockquote><P><BR>Suppose it's initially sitting in equilibrium so that the net force acting on it is zero. Now we lift the mass a small distance and let it go. Because we lifted it, we shortened the spring, reducing its tension. This means the force due to gravity is now more than the spring tension and the mass falls. Eventually it falls below the equilibrium point, increasing the tension in the spring so there is a net force pulling it back up again. To a good approximation, the force restoring the mass to its equilibrium point is proportional to how far it has been displaced. When this happens we end up with oscillating motion where the mass bounces up and down. Here's what a graph of its displacement looks like over time: <P><BR><blockquote><pre><br /><a href="https://1.bp.blogspot.com/-b-GOeLU7VB8/WY5xmupFp7I/AAAAAAAA1G0/Nnun2sPEMAcwqZHI--0K4x_LYB1zZF_JQCLcBGAs/s1600/figure1.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-b-GOeLU7VB8/WY5xmupFp7I/AAAAAAAA1G0/Nnun2sPEMAcwqZHI--0K4x_LYB1zZF_JQCLcBGAs/s400/figure1.png" width="400" height="334" data-original-width="542" data-original-height="452" /></a><br /></pre></blockquote><P><BR>It's actually a sine wave but that detail doesn't matter for us right now. <P><BR>An oscillator where the restoring force is proportional to the displacement from the equilibrium point is called a <a href="https://en.m.wikipedia.org/wiki/Simple_harmonic_motion">simple harmonic oscillator</a> and its oscillation is always described by a sine wave. <P><BR>Note that I'm ignoring friction here. This is a reasonable approximation for many physical systems. <P><BR>Masses on springs aren't all that important in themselves. But simple harmonic oscillators are very common. Another standard example is the pendulum swinging under the influence of gravity: <P><BR><blockquote><pre><br /><a href="https://4.bp.blogspot.com/-nigJFEnaBSk/WY5xmnQZPUI/AAAAAAAA1Gs/iGHkuKA5v1YFENB8K--MH8MtL8Cis3zVwCLcBGAs/s1600/figure2.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-nigJFEnaBSk/WY5xmnQZPUI/AAAAAAAA1Gs/iGHkuKA5v1YFENB8K--MH8MtL8Cis3zVwCLcBGAs/s400/figure2.png" width="253" height="400" data-original-width="227" data-original-height="359" /></a><br /></pre></blockquote><P><BR>At a more fundamental level, an example might be an atom in a crystal being held in place by electrostatic forces from its neighbouring atoms. <P><BR>If you have one of these systems, then in principle you can set it in motion with as little energy as you like. Pull a mass on a spring down a little bit and it will bounce back up, oscillating a certain amount. Pull the mass down half the amount and it'll bounce with oscillations half the size. In principle we could keep repeating this experiment, each time starting with the mass displaced half the amount we tried previously. In other words, a simple harmonic oscillator can have any energy we like. The <i>spectrum</i> of possible energies of one of these oscillators is continuous. (Note that the word <i>spectrum</i> here is merely physicist-speak for a set of possible values.) If we can set one in motion with 1 unit of energy then we can also set it oscillating with 0.5 units, or 0.01 units, or 0.000123 units of energy. <P><BR><BR><b>Quantum mechanics</b><p><P><BR>Everything I've said above is assuming that classical Newtonian mechanics is valid. But we know that for very small systems, around the size of a few atoms or smaller, we need to use quantum mechanics. This is an enormous topic but I'm only going to extract one basic fact. According to quantum mechanics, a simple harmonic oscillator isn't free to oscillate with any energy you like. The possible energy levels, the spectrum of the system, is discrete. There is a lowest energy level, and then all of the energy levels above that are equally spaced like so, going up forever: <P><BR><blockquote><pre><br /><a href="https://1.bp.blogspot.com/-X0ggcaLnmf8/WY5xnJHkSfI/AAAAAAAA1G4/HXn4-M_LwkALNWq7LcSCOrH0JhEaZc55ACLcBGAs/s1600/figure3.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-X0ggcaLnmf8/WY5xnJHkSfI/AAAAAAAA1G4/HXn4-M_LwkALNWq7LcSCOrH0JhEaZc55ACLcBGAs/s400/figure3.png" width="400" height="216" data-original-width="899" data-original-height="485" /></a><br /></pre></blockquote><P><BR>We usually call the lowest energy level the <a href="https://en.m.wikipedia.org/wiki/Vacuum_state">ground state</a> or <i>vacuum state</i> and call the higher levels <i>excited</i> states. <P><BR>The spacing of the energy levels depends on the <i>stiffness</i> of the system, which is just a measure of how much the restoring force increases with displacement from equilibrium. Stiffer systems will have a higher frequency of oscillation and a bigger spacing between the energy levels. <P><BR>(I'm deliberately not saying anything about why we get discrete energy levels in quantum mechanics. I just want to use this one fact so I can get on and talk about photons eventually.) <P><BR>In practice the difference in energy between one level and the next is tiny. This means that if you're literally fiddling about with a mass on a spring you won't ever feel the discreteness. The amount your hand trembles is many orders of magnitude greater than the effect of this discreteness. Nonetheless, it is extremely important when modeling microscopic systems. <P><BR><BR><b>Quantum linguistics</b><p><P><BR>Here are some English sentences we could say about the kinds of systems I've described so far: <P><BR><ol><li>This system is in the ground state.</li><li>That system is in its first excited state</li><li>This system is at an energy level higher than that system</li><li>After allowing these two similar oscillators to interact, the energy level of this oscillator went down and the energy level of that one went up by the same amount.</li></ol><P><BR>Now I want to introduce the (count) noun <a href="https://en.m.wikipedia.org/wiki/Quantum">quantum</a>, with plural <i>quanta</i>. The idea here is not that I'm telling you about a new entity. I want to present this as a new way to talk about things I've already introduced. So rather than give a definition of <i>quantum</i> I will instead show how you can rewrite the above sentences using the language of quanta: <P><BR><ol><li>There are no quanta in this system</li><li>That system has one quantum of energy</li><li>This system has more quanta than that one</li><li>Some quanta were transferred from this system to that system.</li></ol><P><BR>Those sentences make it seem like I'm talking about a new kind of object - the quantum. But I'm not. They're just a manner of speaking about energy levels. I hope I've given you enough examples to get the idea. <P><BR>Just in case you think it's weird to talk about energy levels in terms of quanta, I'd like to remind you that you already do this all the time with money. Dollar bills are actual objects that exist in the world. But money in your bank account isn't. Somewhere in some database is a representation of how much money you have. You might say "I have one hundred dollars in my savings account" But those dollars certainly don't exist as distinct entities. It doesn't really make sense to talk about the thirty-seventh dollar in your bank account. You can transfer dollars from one account to another, and yet what's really happening is that two totals are being adjusted. We treat these accounts a lot like they're containers holding individual objects called dollars. Certainly our language is set up like that. But we know that it's really just the totals that have any kind of representation. The same goes for quanta. It's just a manner of speaking about systems that can have different amounts of energy and where the spectrum of energy levels forms a ladder with equally spaced rungs. Because of your experience with money I probably don't need to give you any more examples. <P><BR>One more bit of terminology: when the spectrum of energies is discrete it's said to be <i>quantised</i>. <P><BR><BR><b>Coupled systems</b><p><P><BR>Let's return to classical physics with a slightly more complex system consisting of two masses connected to springs. We ignore gravity now: <P><BR><blockquote><pre><br /><a href="https://1.bp.blogspot.com/-6XWjuvCMHsk/WY5xnP9DWbI/AAAAAAAA1HA/2fmPSeYZZw4s4SYX6L403EC9p6AxEShKQCLcBGAs/s1600/figure4.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-6XWjuvCMHsk/WY5xnP9DWbI/AAAAAAAA1HA/2fmPSeYZZw4s4SYX6L403EC9p6AxEShKQCLcBGAs/s400/figure4.png" width="400" height="78" data-original-width="819" data-original-height="160" /></a><br /></pre></blockquote><P><BR>We restrict ourselves to just considering back and forth motion constrained along a horizontal line. This is a coupled system. If the left mass moves to the right, not only does it experience a restoring force pushing it left, but the mass on the right will experience more of a force pushing it to the left. We can't treat the masses as independent and so we don't get the simple solution of each mass always oscillating with a sine wave. <P><BR>For this particular problem though there's a trick to turn it into a pair of harmonic oscillators. The idea is to consider the pair of masses as a single entity. We can think of the motion centre of mass of the pair, the midpoint between them, as being one variable that describes this entity. Let's call its motion the <i>external</i> motion. We can also think of the distance between the two masses in the pair as being the system's <i>internal</i> motion. (I'm just using <i>internal</i> and <i>external</i> as convenient names. Don't read too much into them.) It turns out that when you analyse this using classical dynamics the internal motion and the external motion act like independent quantities. What's more, each one behaves exactly like it's simple harmonic. So we get one sine wave describing the overall motion of the pair, and another one that describes how the elements of the pair oscillate with respect to each other. <P><BR>The frequencies of the internal and external motions are typically different. So you can end up with some quite complicated motions with two different frequencies beating against each other. <P><BR>When we're able to find ways to split up the motion into independent quantities, each of which is simple harmonic, each kind of motion is said to be a <a href="https://en.m.wikipedia.org/wiki/Normal_mode">normal mode</a>. <P><BR>When you have independent normal modes, you can treat them independently in quantum mechanics too. So what we get is that the spectrum of possible energy levels for this system is, in some sense, two-dimensional. We can put quanta into the internal oscillation and we can also put quanta into the external oscillation. Because these modes have different frequencies the quanta for each mode correspond to different amounts of energy. <P><BR>(And a reminder: when talking about quantum mechanics I'm not talking literally about masses on springs. I'm talking about physical systems that have equations of motion that mean they behave like masses on springs. In this case it might be a pair of particles trapped in a microscopic well with a repulsive force between them.) <P><BR><BR><b>Solid state physics</b><p>Now I'm going to jump from just two masses to a large number of them. For example, the behavior of trillions of atoms in a solid crystal can be approximately modelled by a grid of masses and springs, of which the following diagram is just a tiny piece: <P><BR><blockquote><pre><br /><a href="https://1.bp.blogspot.com/-dtL7TaPy6QQ/WY5xnPLd99I/AAAAAAAA1G8/qL58GOG1nMwtltmSmM-R0ZLJC9Ar_RTwwCLcBGAs/s1600/figure5.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-dtL7TaPy6QQ/WY5xnPLd99I/AAAAAAAA1G8/qL58GOG1nMwtltmSmM-R0ZLJC9Ar_RTwwCLcBGAs/s400/figure5.png" width="400" height="400" data-original-width="789" data-original-height="789" /></a><br /></pre></blockquote><P><BR>A real crystal would be arranged in a 3D lattice but I've drawn 2D here for convenience. <P><BR>Think of the springs as both pushing apart atoms that get close, and pulling together atoms that move apart. <P><BR>This is a highly coupled system. Ultimately every atom in our lattice is connected to every other one, either directly, or indirectly. Nonetheless, it is still possible to find normal modes. The normal modes all have the same basic form: they are all sinusoidal waves of displacement traveling in some direction with some speed and oscillation frequency. Each of these modes consists of waves that extend through the entire crystal, with fixed spacing between parallel planar wavefronts. This type of waves is known as a plane wave. If the system is perfectly harmonic, so the restoring force is precisely proportional to the displacement, then each direction and frequency of wave oscillates its way through the crystal completely independently of any other. Just as how in the example with two masses any possible oscillation is a combination of internal and external motion, for a crystal lattice any motion is a combination of these plane waves. (Decomposing any oscillation as a combination of plane waves is known as computing its <a href="https://en.m.wikipedia.org/wiki/Fourier_transform">Fourier transform</a>. <P><BR>Now we're ready to consider this situation quantum mechanically. Because each plane wave is a normal mode, we can treat each one as an independent simple harmonic oscillator. This means that the energy in each plane wave is quantised. So when we consider a crystal lattice quantum mechanically we find that its states consist of plane waves propagating through it, but where the amount of energy in each wave is given by a discrete spectrum. So again we can talk about how many quanta there are in each mode. <P><BR>Linguistically it gets a bit more interesting now. Each plane wave is associated with a particular direction and speed so it makes sense to talk of these quanta as having a direction and speed. But note that statements involving quanta are still really just sentences about energy levels. So, for example, the statement "the mode of this system with this velocity and frequency is in its first excited state" is, by definition, exactly the same as "this system has precisely one quantum with this velocity and frequency". In particular, when we write sentences like these we aren't implying that there is some new kind of object, the quantum, that has suddenly attached itself to our crystal. The quanta are properties of the lattice. By the way, in the particular case of vibrating atoms in a lattice, the quanta are known by a special name: <a href="https://en.m.wikipedia.org/wiki/Phonon">phonons</a>. <P><BR><BR><b>Quantum field theory and photons</b><p>And now we're ready to move onto photons. <P><BR>In classical physics, electromagnetism is described by Maxwell's equations. Maxwell's equations say that a varying magnetic field generates an electric field and a varying electric field generates a magnetic field. The result is that it is possible for an oscillating electric field to create an oscillating electric field so that an electric field can propagate through space on its own without the help of electric charges or electric currents or any other kind of `generator'. As these electric fields also produce magnetic fields that propagate with them, the whole thing is called an electromagnetic wave. <P><BR>Just like displacements in a crystal lattice, an electromagnetic wave also has normal modes. The normal modes are plane waves traveling at the speed of light in a particular directions with a given frequency. You have personal experience of this. Visible light is electromagnetic radiation with a frequency of around 500 THz. Wifi uses signals at around 5 GHz. The radio might use signals at around 100 MHz. When you surf the web wirelessly while listening to the radio, the wifi signals don't interfere with your vision or the radio signal. (Actually, wifi might interfere with the radio signals, but not because of the 5 GHz signals. It might happen if badly manufactured hardware emits stray signals around the 100 MHz band.) That's because these waves pass through each other without being coupled to each other in any way. And at this point you might already be guessing what a <i>photon</i> is. For each choice of frequency and direction (and also polarisation, but that's just a detail) the amount of energy that can be in the corresponding mode is quantised. For the electromagnetic field the quanta are called <a href="https://en.m.wikipedia.org/wiki/Photon">photons</a>. <P><BR>And that's it! <P><BR>Electromagnetic waves can be thought of as being made up of different oscillation modes. Because of quantum mechanics, each mode contains an amount of energy that is quantised to be a whole number multiple of some base amount. Although the thing that really matters is the total amount of energy in the modes, it can still be useful to talk about this total as if it's a collection of entities called photons. <P><BR>One thing to notice is that the normal modes for an electromagnetic wave are plane waves that are extended in space. In principle all the way across the universe but for practical problems physicists often consider electromagnetic waves in a large but finite box. This means that adding a quantum to a system has an effect that extends across the entire system. That makes it problematic to talk about the location of a photon. <P><BR><BR><b>Caveat</b><p>Physicists sometimes use the word <i>photon</i> in slightly different but related ways. I've described what I think of as the core definition as presented in many courses on quantum field theory. <P><BR><BR><b>Acknowledgements</b><p>Thanks to <tt>@dmoore2718</tt> for encouraging me to edit this document down to a better size. sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com15tag:blogger.com,1999:blog-11295132.post-10130731775091196692017-07-15T09:09:00.000-07:002017-07-18T06:57:37.046-07:00Self-referential logic via self-referential circuits<BR><b>Introduction</b><p><P><BR><b>TL;DR</b> The behaviour of a certain kind of delay component has a formal similarity to Löb's theorem which gives a way to embed part of provability logic into electronic circuits. <P><BR>Here's a <a href="https://en.wikipedia.org/wiki/Liar_paradox">famous</a> paradoxical sentence: <P><BR><blockquote><i>This sentence is false</i></blockquote><P><BR>If it's false then it's true and if it's true then it's false. <P><BR>Here's a paradoxical electronic circuit: <div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-eddM4bJTB1c/WWV2GUuZHdI/AAAAAAAA0sE/smasg7Ey1uMKjJmhttqcPsWHvydbwTzbACLcBGAs/s1600/not.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-eddM4bJTB1c/WWV2GUuZHdI/AAAAAAAA0sE/smasg7Ey1uMKjJmhttqcPsWHvydbwTzbACLcBGAs/s320/not.png" width="320" height="168" data-original-width="256" data-original-height="134" /></a></div><BR> The component in the middle is an inverter. If the output of the circuit is high then its input is high and then it's output must be low, and vice versa. <P><BR>There's a similarity here. But with a bit of tweaking you can turn the similarity into an isomorphism of sorts. <P><BR>In the first case we avoid paradox by noting that in the mathematical frameworks commonly used by mathematicians it's impossible, in general, for a statement to assert it's own falsity. Instead, a statement can assert its own <i>unprovability</i> and then we get Gödel's incompleteness theorems and a statement that is apparently true and yet can't be proved. <P><BR>In the second case we can't model the circuit straightforwardly as a digital circuit. In practice it might settle down to a voltage that lies between the official high and low voltages so we have to model it as an analogue circuit. Or instead we can introduce a clock and arrange that the feedback in the circuit is delayed. We then get an <a href="https://electronics.stackexchange.com/questions/206907/how-to-delay-a-not-gate-oscillator-to-make-it-run-at-a-desired-frequency">oscillator circuit</a> that can be thought of as outputting a stream of bits. <P><BR>The observation I want to make is that if the feedback delay is defined appropriately, these two scenarios are in some sense isomorphic. This means that we can model classic results about provability, like Gödel's incompleteness theorems, using electronic circuits. We can even use such circuits to investigate what happens when <a href="https://intelligence.org/files/ProgramEquilibrium.pdf">logicians or robots play games like Prisoner's Dilemma</a>. I'll be making use of results found in Boolos' book on <a href="http://www.cambridge.org/catalogue/catalogue.asp?isbn=0521483255">The Logic of Provability</a> and some ideas I borrowed from Smoryński's <a href="https://projecteuclid.org/euclid.bams/1183548781">paper</a> on Fixed Point Algebras. I'll be assuming the reader has at least a slight acquaintance with ithe ideas behind provability logic. <P><BR><BR><b>Provability Logic</b><p>There are many descriptions of <a href="https://en.wikipedia.org/wiki/Provability_logic">provability logic</a> (aka GL) available online, so I'm not going to repeat it all here. However, I've put some background material in the <a href="#appendix">appendix</a> below and I'm going to give a very brief reminder now. <P><BR>Start with (classical) propositional calculus which has a bunch of variables with names like <img src="https://chart.googleapis.com/chart?cht=tx&chl=a%2C%20b%2C%20c%2C%20d%2C%20%5Cldots" style="vertical-align:middle"> and connectives like <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cwedge" style="vertical-align:middle"> for AND, <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cvee" style="vertical-align:middle"> for OR, <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cneg" style="vertical-align:middle"> for NOT and <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Crightarrow" style="vertical-align:middle"> for implication. (Note that <img src="https://chart.googleapis.com/chart?cht=tx&chl=a%5Crightarrow%20b%20%3D%20%5Cneg%20a%5Cvee%20b" style="vertical-align:middle">.) <P><BR>Provability logic extends propositional calculus by adding a unary operator <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox" style="vertical-align:middle">. (I apologise, that's meant to be a □ but it's coming out like <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox" style="vertical-align:middle"> in LaTeX formulae. I think it's a bug in Google's LaTeX renderer.) The idea is that <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox%20p" style="vertical-align:middle"> asserts that <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> is provable in Peano Arithmetic, aka PA. In addition to the axioms of propositional calculus we have <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox%28p%5Crightarrow%20q%29%5Crightarrow%5Cbox%20p%5Crightarrow%5Cbox%20q" style="vertical-align:middle"></blockquote>and <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox%20p%5Crightarrow%5Cbox%5Cbox%20p" style="vertical-align:middle"></blockquote>as well as a rule that allows us to deduce <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox%20p" style="vertical-align:middle"> from <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle">. <P><BR>We also have this fixed point property: <P><BR><blockquote>Let <img src="https://chart.googleapis.com/chart?cht=tx&chl=F%28p%29" style="vertical-align:middle"> be any predicate we can write in the language of GL involving the variable <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle">, and suppose that every appearance of <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> in <img src="https://chart.googleapis.com/chart?cht=tx&chl=F%28p%29" style="vertical-align:middle"> is inside a <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox" style="vertical-align:middle">, e.g. <img src="https://chart.googleapis.com/chart?cht=tx&chl=F%28p%29%3D%5Cbox%20p%5Cvee%5Cbox%28%5Cneg%20p%29" style="vertical-align:middle">. Then there is a fixed point, i.e. a proposition <img src="https://chart.googleapis.com/chart?cht=tx&chl=q" style="vertical-align:middle"> that makes no mention of <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> such that <img src="https://chart.googleapis.com/chart?cht=tx&chl=q%5Cleftrightarrow%20F%28q%29" style="vertical-align:middle"> is a theorem. In effect, for any such <img src="https://chart.googleapis.com/chart?cht=tx&chl=F" style="vertical-align:middle">, <img src="https://chart.googleapis.com/chart?cht=tx&chl=q" style="vertical-align:middle"> is a proposition that asserts <img src="https://chart.googleapis.com/chart?cht=tx&chl=F%28q%29" style="vertical-align:middle">. </blockquote><P><BR>See the <a href="#appendix">appendix</a> for a brief mention of why we should expect this to be true. <P><BR>From the fixed point property we can deduce Löb's theorem: <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox%28%5Cbox%20p%5Crightarrow%20p%29%5Crightarrow%5Cbox%20p" style="vertical-align:middle">. There is a <a href="https://en.wikipedia.org/wiki/Löb%27s_theorem">proof</a> at wikipedia that starts from the fixed point property. <P><BR>We can also deduce the fixed point property from Löb's theorem so it's more usual to take Löb's theorem as an axiom of GL and show that the fixed point property follows. You can think of Löb's theorem as a cunning way to encode the fixed point property. In fact <a href="http://semantic-domain.blogspot.com/2016/05/lobs-theorem-is-almost-y-combinator.html">you can argue</a> that it's a sort of Y-combinator, the function that allows the formation of recursive fixed points in functional programming languages. (That's also, sort of, the role played by the <tt>loeb</tt> function I defined <a href="http://blog.sigfpe.com/2006/11/from-l-theorem-to-spreadsheet.html">way back</a>. But note that <tt>loeb</tt> isn't really a proof of Löb's theorem, it just has formal similarities.) <P><BR><BR><b>Back to electronic circuits</b><p>In order to make digital circuits with feedback loops well-behaved I could introduce a circuit element that results in a delay of one clock cycle. If you insert one of these into the inverter circuit I started with you'll end up with an oscillator that flips back and forth between 0 and 1 on each clock cycle. But I want to work with something slightly stricter. I'd like my circuits to eventually stop oscillating. (I have an ulterior motive for studying these.) Let me introduce this component: <div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-qSHnBEcazFw/WWZAsml7evI/AAAAAAAA0tE/txIeWd8WljYf7pRC5guVsBkE9Sc0YUa8QCLcBGAs/s1600/delay.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/-qSHnBEcazFw/WWZAsml7evI/AAAAAAAA0tE/txIeWd8WljYf7pRC5guVsBkE9Sc0YUa8QCLcBGAs/s320/delay.png" width="320" height="137" data-original-width="196" data-original-height="84" /></a></div><BR> It is intended to serve as a delayed latch and I'll always have the flow of data being from left to right. The idea is that when it is switched on it outputs 1. It keeps outputting 1 until it sees a 0 input. When that happens, then on the next clock cycle its output drops to 0 and never goes back up to 1 until reset. <P><BR>Because the output of our delay-latch isn't a function of its current input, we can't simply describe its operation as a mathematical function from <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5C%7B0%2C1%5C%7D" style="vertical-align:middle"> to <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5C%7B0%2C1%5C%7D" style="vertical-align:middle">. Instead let's think of electronic components as binary operators on bitstreams, i.e. infinite streams of binary digits like <tt>...00111010</tt> with the digits emerging over time starting with the one written on the right and working leftwards. The ordinary logic gates perform bitwise operations which I'll represent using the operators in the C programming language. For example, <blockquote><tt>...001110 & ...101010 = ...001010</tt></blockquote>and <blockquote><tt>~...101 = ...010</tt></blockquote>and so on. Let's use □ to represent the effect of latch-delay on a bitstream. We have, for example, <blockquote><tt>□...000 = ...001</tt></blockquote>and <blockquote><tt>□...11101111 = ...00011111</tt>. </blockquote>The operator □ takes the (possibly empty) contiguous sequence of 1's at the end of the bitstream, extends it by one 1, and sets everything further to the left to 0. If we restrict ourselves to bitstreams that eventually become all 0's or all 1's on the left, then bitstreams are in one-to-one correspondence with the integers using the twos complement representation. For example <tt>...111111</tt>, all 1's, represents the number -1. I'll simply call the bistreams that represent integers integers. With this restriction we can use a classic C hacker trick to write <tt>□p=p^(p+1)</tt> where <tt>^</tt> is the C XOR operator. The operator □ outputs the bits that get flipped when you add one. <P><BR>Let's use the symbol <tt>→</tt> so that <tt>a → b</tt> is shorthand for <tt>~a|b</tt>. Here are some properties of □: <P><BR>1. <tt>□(-1) = -1</tt><P><BR>2. <tt>□p → □□p = -1</tt><P><BR>3. <tt>□(p → q) → □p → □q = -1</tt><P><BR>In addition we have the fixed point property: <P><BR><blockquote>Let F(p) be any function of p we can write using □ and the bitwise logical operators and such that all occurrences of p occur inside □. Then there is a unique bitstream q such that q=F(q). </blockquote><P><BR>We can make this clearer if we return to circuits. F(p) can be thought of as a circuit that takes p as input and outputs some value. We build the circuit using only boolean logic gates and delay-latch. We allow feedback loops, but only ones that go through delay-latches. With these restrictions it's pretty clear that the circuit is well-behaved and deterministically outputs a bitstream. <P><BR>We also have the Löb property: <P><BR>4. □(□p → p) → □p = -1 <P><BR>We can see this by examining the definition of □. Intuitively it says something like "once □ has seen a 0 input then no amount of setting input bits to 1 later in the stream make any different to its output". <P><BR>I hope you've noticed something curious. These properties are extremely close to the properties of <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox" style="vertical-align:middle"> in GL. In fact, these electronic circuits form a model of the part of GL that doesn't involve variable names, i.e. what's known as letterless GL. We can formalise this: <P><BR>1. Map <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbot" style="vertical-align:middle"> to a wire set to 0, which outputs <tt>...000 = 0</tt>. <P><BR>2. Map <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctop" style="vertical-align:middle"> to a wire set to 1, which outputs <tt>...111 = -1</tt>. <P><BR>3. Map <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%20%5Ccirc%20q" style="vertical-align:middle">, where <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ccirc" style="vertical-align:middle"> is a binary connective, by creating a circuit that takes the outputs from the circuits for <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=q" style="vertical-align:middle"> and passes them into the corresponding boolean logic gate. <P><BR>4. Map <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox%20p" style="vertical-align:middle"> to the circuit for <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> piped through a delay-latch. <P><BR>For example, let's convert <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox%28%5Cbox%5Cbot%5Crightarrow%5Cbot%29%5Crightarrow%5Cbox%5Cbot" style="vertical-align:middle"> into a circuit. I'm translating <img src="https://chart.googleapis.com/chart?cht=tx&chl=a%5Crightarrow%20b" style="vertical-align:middle"> to the circuit for <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cneg%20a%5Cvee%20b" style="vertical-align:middle">. <div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-ktaB3QyciLQ/WWbFbBtW4QI/AAAAAAAA0tw/Zl2EjHqVuwEHTilSJtLoZdduN0m5OKFuACLcBGAs/s1600/a.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-ktaB3QyciLQ/WWbFbBtW4QI/AAAAAAAA0tw/Zl2EjHqVuwEHTilSJtLoZdduN0m5OKFuACLcBGAs/s640/a.gif" width="640" height="103" data-original-width="1248" data-original-height="200" /></a></div> <P><BR>I'm using red wires to mean wires carrying the value 1 rather than 0. I hope you can see that this circuit eventually settles into a state that outputs nothing but 1s. <P><BR>We have this neat result: <blockquote>Because delay-latch satisfies the same equations as <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox" style="vertical-align:middle"> in provability logic, any theorem, translated into a circuit, will produce a bistream of just 1s, i.e. -1. </blockquote><P><BR>But here's a more surprising result: the converse is true. <blockquote>If the circuit corresponding to a letterless GL proposition produces a bistream of just 1s then the proposition is actually a theorem of GL. </blockquote>I'm not going to prove this. (It's actually a disguised form of lemma 7.4 on p.95 of Boolos' book.) In the pictured example we got <tt>...1111</tt>, so the circuit represents a theorem. As it represents Löb's theorem for the special case <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%3D%5Cbot" style="vertical-align:middle"> we should hope so. More generally, any bitstream that represents an integer can be converted back into a proposition that is equivalent to the original proposition. This means that bitstreams faithfully represent propositions of letterless GL. I'm not going to give the translation here but it's effectively given in Chapter 7 of Boolos. I'll use <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cpsi%28p%29" style="vertical-align:middle"> to represent the translation from propositions to bitstreams via circuits that I described above. Use <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cphi%28b%29" style="vertical-align:middle"> to represent the translation of bitstream <img src="https://chart.googleapis.com/chart?cht=tx&chl=b" style="vertical-align:middle"> back into propositions. We have <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%5Cleftrightarrow%5Cphi%28%5Cpsi%28p%29%29" style="vertical-align:middle">. But I haven't given a full description of <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cphi" style="vertical-align:middle"> and I haven't proved here that it has this property. <P><BR><BR><b>Circuits with feedback</b><p><P><BR>In the previous section I considered letterless propositions of GL. When these are translated into circuits they don't have feedback loops. But we can also "solve equations" in GL using circuits <i>with</i> feedback. The GL fixed point theorem above says that we can "solve" the equation <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%5Cleftrightarrow%20F%28p%29" style="vertical-align:middle">, with one letter <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle">, to produce a letterless proposition <img src="https://chart.googleapis.com/chart?cht=tx&chl=q" style="vertical-align:middle"> such that <img src="https://chart.googleapis.com/chart?cht=tx&chl=q%5Cleftrightarrow%20F%28q%29" style="vertical-align:middle">. Note here that <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> is a letter in the language of GL. But I'm using <img src="https://chart.googleapis.com/chart?cht=tx&chl=q" style="vertical-align:middle"> to represent a proposition in letterless GL. If we build a circuit to represent <img src="https://chart.googleapis.com/chart?cht=tx&chl=F" style="vertical-align:middle">, and feed its output back into where <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> appears, then the output bitstream represents the fixed point. Here's a translation of the equation <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%20%5Cleftrightarrow%20%5Cneg%28%5Cbox%20p%20%5Cvee%20%5Cbox%5Cbox%5Cbox%20p%29" style="vertical-align:middle">: <div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-Ldwk2xdxuzI/WWeKAbF8pdI/AAAAAAAA0uM/bjcJSFtTiBYexX8Tzzt8yVWZ16M6BIuAwCLcBGAs/s1600/animated.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-Ldwk2xdxuzI/WWeKAbF8pdI/AAAAAAAA0uM/bjcJSFtTiBYexX8Tzzt8yVWZ16M6BIuAwCLcBGAs/s640/animated.gif" width="640" height="210" data-original-width="1012" data-original-height="332" /></a></div> I'll let you try to convince yourself that such circuits always eventually output all 0's or all 1's. When we run the circuit we get the output <tt>...1111000 = -8</tt>. As this is not -1 we know that the fixed point isn't a theorem. If I'd defined <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cphi" style="vertical-align:middle"> above you could use it to turn the bitstream back into a proposition. <P><BR><BR><b>The same, syntactically (optional section)</b><p>I have a Haskell library on github for working with GL: <a href="https://github.com/dpiponi/provability">provability</a>. This uses a syntactic approach and checks propositions for theoremhood using a <a href="https://en.wikipedia.org/wiki/Method_of_analytic_tableaux">tableau method</a>. We can use it to analyse the above example with feedback. I have implemented a function, currently called <tt>value'</tt>, to perform the evaluation of the bitstream for a proposition. However, in this case the <tt>fixedpoint</tt> function computes the fixed point proposition first and then converts to a bitstream rather than computing the bitstream directly from the circuit for F: <P><BR><pre><br />> let f p = Neg (Box p \/ Box (Box (Box p)))<br />> let Just p = fixedpoint f<br />> p<br />Dia T /\ Dia (Dia T /\ Dia (Dia T /\ Dia T))<br />> value' p<br />-8<br /><P><BR><br /></pre>(Note that <tt>Dia p</tt> means <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cdia%20p%20%3D%20%5Cneg%5Cbox%5Cneg%20p" style="vertical-align:middle">.) <P><BR>The function <tt>fixedpoint</tt> does a lot of work under the hood. (It uses a tableau method to carry out <a href="https://en.wikipedia.org/wiki/Craig_interpolation">Craig interpolation</a>.) The circuit approach requires far less work. <P><BR><BR><b>Applications</b><p><i>1. Programs that reason about themselves</i><P><BR>In principle we can write a program that enumerates all theorems of PA. That means we can use a quine trick to write a computer program that searches for a proof, in PA, of its own termination. Does such a program terminate? <P><BR>We can answer this with Löb's theorem. Let <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%20%3D" style="vertical-align:middle"> "The program terminates". The program terminates if it can prove its termination. Formally this means we assume <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox%20p%5Crightarrow%20p" style="vertical-align:middle">. Using one of the derivation rules of GL we get <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox%28%5Cbox%20p%5Crightarrow%20p%29" style="vertical-align:middle">. Löb's theorem now gives us <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox%20p" style="vertical-align:middle">. Feed that back into our original hypothesis and we get <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle">. In other words, we deduce that our program does in fact terminate. (Thanks to Sridhar Ramesh for pointing this out to me.) <P><BR>But we can deduce this using a circuit. We want a solution to <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%5Cleftrightarrow%20%5Cbox%20p" style="vertical-align:middle">. Here's the corresponding circuit: <div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-EcVE5Qpyc-o/WWeSmMu8dLI/AAAAAAAA0uc/G8uTCSTDK8sRS90j54_x6Yh2H1KNW38rACLcBGAs/s1600/loop1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://2.bp.blogspot.com/-EcVE5Qpyc-o/WWeSmMu8dLI/AAAAAAAA0uc/G8uTCSTDK8sRS90j54_x6Yh2H1KNW38rACLcBGAs/s320/loop1.png" width="320" height="120" data-original-width="342" data-original-height="128" /></a></div> It starts by outputting 1's and doesn't stop. In other words, the fixed point is a theorem. And that tells us <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> is a theorem. And hence that the program terminates. <P><BR><i>2. Robots who reason about each others play in Prisoner's Dilemma</i><P><BR>For the background to this problem see <a href="http://lesswrong.com/lw/hmw/robust_cooperation_in_the_prisoners_dilemma/">Robust Cooperation in the Prisoner's Dilemma</a> at LessWrong. We have two robot participants <img src="https://chart.googleapis.com/chart?cht=tx&chl=A" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=B" style="vertical-align:middle"> playing Prisoner's Dilemma. Each can examine the other's source code and can search for proofs that the opponent will cooperate. Suppose each robot is programmed to enumerate all proofs of PA and cooperate if it finds a proof that its opponent will cooperate. Here we have <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%20%3D" style="vertical-align:middle"> "A will cooperate" and <img src="https://chart.googleapis.com/chart?cht=tx&chl=q%20%3D" style="vertical-align:middle"> "B will cooperate". Our assumptions about the behaviour of the robots are <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%20%5Cleftrightarrow%20%5Cbox%20q" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=q%20%5Cleftrightarrow%20%5Cbox%20p" style="vertical-align:middle">, and hence that <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%20%5Cleftrightarrow%20%5Cbox%5Cbox%20p" style="vertical-align:middle">. This corresponds to the circuit: <div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-bHPAL4Vvlpk/WWeU3IR-M8I/AAAAAAAA0uk/nXJTLjCFLS0fKihAolPtzHJtb70OQyROwCLcBGAs/s1600/loop2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-bHPAL4Vvlpk/WWeU3IR-M8I/AAAAAAAA0uk/nXJTLjCFLS0fKihAolPtzHJtb70OQyROwCLcBGAs/s320/loop2.png" width="320" height="83" data-original-width="492" data-original-height="128" /></a></div> This outputs <tt>...1111 = -1</tt> so we can conclude <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> and hence that these programs will cooperate. (Note that this doesn't work out nicely if robot B has a program that doesn't terminate but whose termination isn't provable in the formal system A is using. That means this approach is only good for robots that want to cooperate and want to confirm such cooperation. See the <a href="https://arxiv.org/abs/1401.5577">paper</a> for more on this.) <P><BR>At this point I really must emphasise that these applications are deceptively simple. I've shown how these simple circuits can answer some tricky problems about provability. But these aren't simply the usual translations from boolean algebra to logic gates. They work because circuits with delay-latch provide a model for letterless provability logic and that's only the case because of a lot of non-trivial theorem proving in Boolos that I haven't reproduced here. You're only allowed to use these simple circuits once you've seen the real proofs :-) <P><BR><BR><b>Things I didn't say above</b><p>1. I described the translation from propositions to circuits that I called <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cpsi" style="vertical-align:middle"> above. But I didn't tell you what <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cphi" style="vertical-align:middle"> looks like. I'll leave this as an exercise. (Hint: consider the output from the translation of <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox%5En%5Cbot" style="vertical-align:middle"> into a circuit.) <P><BR>2. The integers, considered as bistreams, with the bitwise operators, and the unary operator <tt>□p=p^(p+1)</tt>, form an algebraic structure. For example, if we define <tt>⋄p=~□~p</tt> we have a <a href="https://ncatlab.org/nlab/show/Magari+algebra">Magari algebra</a>. Structures like these are intended to capture the essential parts of self-referential arguments in an algebraic way. <P><BR>3. Because of the interpretation of □ as a delayed latch in a circuit you could view it as saying "my input was always true until a moment ago". This surely embeds provability logic in a <a href="https://en.wikipedia.org/wiki/Temporal_logic">temporal logic</a> of some sort. <P><BR>4. (Deleted speculations about tit-for-tat that need rethinking.) <P><BR>5. For even the most complex letterless proposition in Boolos you could check its theoremhood with a pretty small circuit. You could even consider doing this with a steam powered <a href="https://en.wikipedia.org/wiki/Pneumatic_circuit">pneumatic circuit</a>. I had to say that to fulfil a prophecy and maintain the integrity of the timeline. <P><BR> <a id="appendix"></a> <BR><b>Appendix on provability</b><p>The modern notion of a proof is that it is a string of symbols generated from some initial strings called "axioms" and some derivation rules that make new strings from both axioms and strings you've derived previously. Usually we pick axioms that represent "self-evident" truths and we pick derivation rules that are "truth-preserving" so that every proof ends at a true proposition of which it is a proof. The derivation rules are mechanical in nature: things like "if you have this symbol here and that symbol there then you can replace this symbol with that string you derived earlier" etc. <P><BR>You can represent strings of symbols using numbers, so-called Gödel numbers. Let's pick a minimal mathematical framework for working with numbers: <a href="https://en.wikipedia.org/wiki/Peano_axioms#First-order_theory_of_arithmetic">Peano Arithmetic</a>, aka PA. Let's assume we've made some choice of Gödel numbering scheme and when <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> is a proposition, write <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Bp%5D" style="vertical-align:middle"> for the number representing <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle">. You can represent the mechanical derivation rules as operations on numbers. And that makes it possible to define a mathematical predicate <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov" style="vertical-align:middle"> that is true if and only if its argument represents a provable proposition. <P><BR>In other words, we can prove <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov%28%5Bp%5D%29" style="vertical-align:middle"> using PA if and only if <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> is a proposition provable in PA. <P><BR>The predicate <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov" style="vertical-align:middle"> has some useful properties: <P><BR>1.<i>If we can prove <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle">, then we can prove <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov%28%5Bp%5D%29" style="vertical-align:middle">.</i><P><BR>We take the steps we used to prove <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle">, and convert everything to propositions about numbers. If <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov" style="vertical-align:middle"> is defined correctly then we can convert that sequence of numbers into a sequence of propositions about those numbers that makes up a proof of <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov%28p%29" style="vertical-align:middle">. <P><BR>2.<i><img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov%28%5Bp%5Crightarrow%20q%5D%29" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov%28%5Bp%5D%29" style="vertical-align:middle"> imply <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov%28%5Bq%5D%29" style="vertical-align:middle"></i><P><BR>A fundamental step in any proof is <i>modus ponens</i>, i.e. that <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%5Crightarrow%20q" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=q" style="vertical-align:middle"> implies <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle">. If <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov" style="vertical-align:middle"> does its job correctly then it had better know about this. <P><BR>3.<i><img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov%28%5Bp%5D%29" style="vertical-align:middle"> implies <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov%28%5BProv%28%5Bp%5D%29%29" style="vertical-align:middle"></i><P><BR>One way is to prove this is to use Löb's theorem. <P><BR>4. <i><img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov%28%5B%5Ctop%5D%29" style="vertical-align:middle"></i><P><BR>The trivially true statement had better be provable or <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov" style="vertical-align:middle"> is broken. <P><BR>Constructing <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov" style="vertical-align:middle"> is conceptually straightforward but hard work. I'm definitely not going to do it here. <P><BR>And there's one last thing we need: self-reference. If <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> is a proposition, how can we possibly assert <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov%28%5Bp%5D%29" style="vertical-align:middle"> without squeezing a copy of <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Bp%5D" style="vertical-align:middle"> inside <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle">? I'm not going to do that here either - just mention that we can use a variation of <a href="https://en.wikipedia.org/wiki/Quine_%28computing%29">quining</a> to achieve this. That allows us to form a proposition <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> for which we can prove <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%5Cleftrightarrow%20Prov%28%5Bp%5D%29" style="vertical-align:middle">. In fact, we can go further. We can find propositions that solve <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%5Cleftrightarrow%20F%28p%29" style="vertical-align:middle"> for any predicate <img src="https://chart.googleapis.com/chart?cht=tx&chl=F%28p%29" style="vertical-align:middle"> built from the usual boolean operations and <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> as long as all of the occurrences of <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> are inside the appearances of <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov" style="vertical-align:middle">. Even though we can't form a proposition that directly asserts its own falsity, we can form one that asserts that it is unprovable, or one that asserts that you can't prove that you can't prove that you can prove it, or anything along those lines. <P><BR>Anyway, all that <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5B%5D" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov" style="vertical-align:middle"> business is a lot of hassle. Provability logic, also known as GL, is intended to capture specifically the parts of PA that relate to provability. GL is propositional calculus extended with the provability operator <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox" style="vertical-align:middle">. The intention is that if <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> is a proposition, <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbox%20p" style="vertical-align:middle"> is a proposition in GL that represents <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov%28%5Bp%5D%29" style="vertical-align:middle"> in PA. The properties of <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov" style="vertical-align:middle"> above become the axioms and derivation rules of GL in the main text. sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com6tag:blogger.com,1999:blog-11295132.post-4318209395803516382017-06-06T20:32:00.000-07:002017-06-07T17:41:04.895-07:00A relaxation technique<BR><b>Introduction</b><p>Sometimes you want to differentiate the expected value of something. I've written about some tools that can help with this. For example you can use <a href="http://blog.sigfpe.com/2005/07/automatic-differentiation.html">Automatic Differentiation</a> for the derivative part and <a href="http://blog.sigfpe.com/2007/03/monads-vector-spaces-and-quantum.html">probability monads</a> for the expectation. But the probability monad I described in that article computes the complete probability distribution for your problem. Frequently this is intractably large. Instead people often use Monte Carlo methods. They'll compute the "something" many times, substituting pseudo-random numbers for the random variables, and then average the results. This provides an estimate of the expected value and is ubiquitous in many branches of computer science. For example it's the basis of ray-tracing and path-tracing algorithms in 3D rendering, and plays a major role in machine learning when used in the form of stochastic gradient descent. <P><BR>But there's a catch. Suppose we want to compute <img src="https://chart.googleapis.com/chart?cht=tx&chl=E%28f%28X_1%2CX_2%2C%5Cldots%2CX_n%29%29" style="vertical-align:middle"> where each of the <img src="https://chart.googleapis.com/chart?cht=tx&chl=X_i" style="vertical-align:middle"> belong to the Bernoulli distribution <img src="https://chart.googleapis.com/chart?cht=tx&chl=B%28p%29" style="vertical-align:middle">. I.e. each <img src="https://chart.googleapis.com/chart?cht=tx&chl=X_n" style="vertical-align:middle"> has a probability <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> of being 1 and probability <img src="https://chart.googleapis.com/chart?cht=tx&chl=1-p" style="vertical-align:middle"> of being 0. If we compute this using a Monte Carlo approach we'll repeatedly generate pseudo-random numbers for each of the <img src="https://chart.googleapis.com/chart?cht=tx&chl=X_i" style="vertical-align:middle">. Each one will be 0 or 1. This means that our estimate depends on <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> via subexpressions that can't meaningfully be differentiated with respect to <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle">. So how can we use automatic differentiation with the Monte Carlo method? I'm proposing an approach that may or may not already be in the literature. Whether it is or not, I think it's fun to get there by combining many of the things I've previously talked about here, such as free monads, <a href="https://en.wikipedia.org/wiki/Negative_probability">negative probabilities</a> and automatic differentiation. I'm going to assume you're familiar with using dual numbers to compute derivatives as I've written about this before and <a href="https://en.wikipedia.org/wiki/Dual_number">wikipedia</a> has the basics. <P><BR><BR><b>A probability monad</b><p><P><BR>I want to play with a number of different approaches to using monads with probability theory. Rather than define lots of monads I think that the easiest thing is to simply work with one free monad and then provide different interpreters for it. <P><BR>First some imports: <P><BR><pre><br />> import Control.Monad<br />> import qualified System.Random as R<br />> import qualified Data.Map.Strict as M<br /><P><BR><br /></pre>I'm going to use a minimal free monad that effectively gives us a DSL with a new function that allows us to talk about random Bernoulli variables: <P><BR><pre><br />> data Random p a = Pure a | Bernoulli p (Int -> Random p a)<br /><P><BR><br /></pre>The idea is that <tt>Pure a</tt> represents the value <tt>a</tt> and <tt>Bernoulli p f</tt> is used to say "if we had a random value <tt>x</tt>, <tt>f x</tt> is the value we're interested in". The <tt>Random</tt> type isn't going to do anything other than represent these kinds of expressions. There's no implication that we actually have a random value for <tt>x</tt> yet. <P><BR><pre><br />> instance Functor (Random p) where<br />> fmap f (Pure a) = Pure (f a)<br />> fmap f (Bernoulli p g) = Bernoulli p (fmap f . g)<br /><P><BR><br />> instance Applicative (Random p) where<br />> pure = return<br />> (<*>) = ap<br /><P><BR><br />> instance Monad (Random p) where<br />> return = Pure<br />> Pure a >>= f = f a<br />> Bernoulli p g >>= f = Bernoulli p (\x -> g x >>= f)<br /><P><BR><br /></pre>We'll use <tt>bernoulli p</tt> to represent a random Bernoulli variable drawn from <img src="https://chart.googleapis.com/chart?cht=tx&chl=B%28p%29" style="vertical-align:middle">. <P><BR><pre><br />> bernoulli :: p -> Random p Int<br />> bernoulli p = Bernoulli p return<br /><P><BR><br /></pre>So let's write our first random expression: <P><BR><pre><br />> test1 :: Random Float Float<br />> test1 = do<br />> xs <- replicateM 4 (bernoulli 0.75)<br />> return $ fromIntegral $ sum xs<br /><P><BR><br /></pre>It sums 4 Bernoulli random variables from <img src="https://chart.googleapis.com/chart?cht=tx&chl=B%28%5Cfrac%7B3%7D%7B4%7D%29" style="vertical-align:middle"> and converts the result to a <tt>Float</tt>. The expected value is 3. <P><BR>We don't yet have a way to do anything with this expression. So let's write an interpreter that can substitute pseudo-random values for each occurrence of <tt>bernoulli p</tt>: <P><BR>It's essentially interpreting our free monad as a state monad where the state is the random number seed: <P><BR><pre><br />> interpret1 :: (Ord p, R.Random p, R.RandomGen g) => Random p a -> g -> (a, g)<br />> interpret1 (Pure a) seed = (a, seed)<br />> interpret1 (Bernoulli prob f) seed = <br />> let (r, seed') = R.random seed<br />> b = if r <= prob then 1 else 0<br />> in interpret1 (f b) seed'<br /><P><BR><br /></pre>You can use the expression <tt>R.getStdRandom (interpret1 test1)</tt> if you want to generate some random samples for yourself. <P><BR>We're interested in the expected value, so here's a function to compute that: <P><BR><pre><br />> expect1 :: (Fractional p, Ord p, R.Random p, R.RandomGen g) => Random p p -> Int -> g -> (p, g)<br />> expect1 r n g = <br />> let (x, g') = sum1 0 r n g<br />> in (x/fromIntegral n, g')<br /><P><BR><br />> sum1 :: (Ord p, Num p, R.Random p, R.RandomGen g) => p -> Random p p -> Int -> g -> (p, g)<br />> sum1 t r 0 g = (t, g)<br />> sum1 t r n g =<br />> let (a, g') = interpret1 r g<br />> in sum1 (t+a) r (n-1) g'<br /><P><BR><br /></pre>You can test it out with <tt>R.getStdRandom (expect1 test1 1000)</tt>. You should get values around 3. <P><BR>We can try completely different semantics for <tt>Random</tt>. This time we compute the entire probability distribution: <P><BR><pre><br />> interpret2 :: (Num p) => Random p a -> [(a, p)]<br />> interpret2 (Pure a) = [(a, 1)]<br />> interpret2 (Bernoulli p f) =<br />> scale p (interpret2 (f 1)) ++ scale (1-p) (interpret2 (f 0))<br /><P><BR><br />> scale :: Num p => p -> [(a, p)] -> [(a, p)]<br />> scale s = map (\(a, p) -> (a, s*p))<br /><P><BR><br /></pre>You can try it with <tt>interpret2 test1</tt>. <P><BR>Unfortunately, as it stands it doesn't collect together multiple occurrences of the same value. We can do that with this function: <P><BR><pre><br />> collect :: (Ord a, Num b) => [(a, b)] -> [(a, b)]<br />> collect = M.toList . M.fromListWith (+)<br /><P><BR><br /></pre>And now you can use <tt>collect (interpret2 test1)</tt>. <P><BR>Let's compute some expected values: <P><BR><pre><br />> expect2 :: (Num p) => Random p p -> p<br />> expect2 r = sum $ map (uncurry (*)) (interpret2 r)<br /><P><BR><br /></pre>The value of <tt>expect2 test1</tt> should be exactly 3. One nice thing about <tt>interpret2</tt> is that it is differentiable with respect to the Bernoulli parameter when this is meaningful. Unfortunately it has one very big catch: the value of <tt>interpret2</tt> can be a very long list. Even a small simulation can results in lists too big to store in the known universe. But <tt>interpret1</tt> doesn't produce differentiable results. Is there something in-between these two interpreters? <P><BR><BR><b>Importance sampling</b><p>Frequently in Monte Carlo sampling it isn't convenient to sample from the distribution you want. For example it might be intractably hard to do so, or you might have proven that the resulting estimate has a high variance. So instead you can sample from a different, but possibly related distribution. This is known as <a href="https://en.wikipedia.org/wiki/Importance_sampling">importance sampling</a>. Whenever you do this you must keep track of how "wrong" your probability was and patch up your expectation estimate at the end. For example, suppose a coin comes up heads 3/4 of the time. Instead of simulating a coin toss that comes up 3/4 of the time you could simulate one that comes up heads half of the time. Suppose at one point in the simulation it does come up heads. Then you used a probability of 1/2 when you should have used 3/4. So when you compute the expectation you need to scale the contribution from this sample by (3/4)/(1/2) = 3/2. You need so scale appropriately for every random variable used. A straightforward way to see this for the case of a single Bernoulli variable is to note that <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=E%28f%28X%29%29%20%3D%20pf%281%29%2B%281-p%29f%280%29%20%3D%20q%5Cfrac%7Bp%7D%7Bq%7Df%281%29%2B%281-q%29%5Cfrac%7B1-p%7D%7B1-q%7Df%280%29" style="vertical-align:middle">. </blockquote>We've replaced probabilities <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=1-p" style="vertical-align:middle"> with <img src="https://chart.googleapis.com/chart?cht=tx&chl=q" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=1-q" style="vertical-align:middle"> but we had to scale <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle"> appropriately in each of the cases <img src="https://chart.googleapis.com/chart?cht=tx&chl=f%280%29" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=f%281%29" style="vertical-align:middle"> to keep the final value the same. I'm going to call the scale value the <i>importance</i>. If we generate <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> random numbers in a row we need to multiply all of the <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> importance values that we generate. This is a perfect job for the <tt>Writer</tt> monad using the <tt>Product</tt> monoid. (See Eric Kidd's <a href="http://www.randomhacks.net/files/build-your-own-probability-monads.pdf">paper</a> for some discussion about the connection between <tt>Writer</tt> and importance sampling.) However I'm just going to write an explicit interpreter for our free monad to make it clear what's going where. <P><BR>This interpreter is going to take an additional argument as input. It'll be a rule saying what probability we should sample with when handling a variable drawn from <img src="https://chart.googleapis.com/chart?cht=tx&chl=B%28p%29" style="vertical-align:middle">. The probability should be a real number in the interval <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5B0%2C1%5D" style="vertical-align:middle">. <P><BR><pre><br />> interpret3 :: (Fractional p, R.RandomGen g) =><br />> (p -> Float) -> Random p a -> g -> ((a, p), g)<br />> interpret3 rule (Pure a) g = ((a, 1), g)<br />> interpret3 rule (Bernoulli p f) g = <br />> let (r, g') = R.random g<br />> prob = rule p<br />> (b, i) = if (r :: Float) <= prob<br />> then (1, p/realToFrac prob)<br />> else (0, (1-p)/realToFrac (1-prob))<br />> ((a, i'), g'') = interpret3 rule (f b) g'<br />> in ((a, i*i'), g'')<br /><P><BR><br /></pre>Here's the accompanying code for the expectation: <P><BR><pre><br />> expect3 :: (Fractional p, R.RandomGen g) =><br />> (p -> Float) -> Random p p -> Int -> g -> (p, g)<br />> expect3 rule r n g = <br />> let (x, g') = sum3 rule 0 r n g<br />> in (x/fromIntegral n, g')<br /><P><BR><br />> sum3 :: (Fractional p, R.RandomGen g) =><br />> (p -> Float) -> p -> Random p p -> Int -> g -> (p, g)<br />> sum3 rule t r 0 g = (t, g)<br />> sum3 rule t r n g =<br />> let ((a, imp), g') = interpret3 rule r g<br />> in sum3 rule (t+a*imp) r (n-1) g'<br /><P><BR><br /></pre>For example, you can estimate the expectation of <tt>test1</tt> using unbiased coin tosses by evaluating <tt>R.getStdRandom (expect3 (const 0.5) test1 1000)</tt>. <P><BR><BR><b>Generalising probability</b><p>Did you notice I made my code slightly more general than seems to be needed? Although I use probabilities of type <tt>Float</tt> to generate my Bernoulli samples, the argument to the function <tt>bernoulli</tt> can be of a more general type. This means that we can use importance sampling to compute expected values for generalised measures that take values in a more general algebraic structure than the interval [0,1]. For example, we could use negative probabilities. <a href="https://arxiv.org/abs/1401.2561">An Operational Interpretation of Negative Probabilities and No-Signalling Models</a> by Adamsky and Brandenberger give a way to interpret expressions involving negative probabilities. We can implement it using <tt>interpret3</tt> and the rule <tt>\p -> abs p/(abs p+abs (1-p))</tt>. Note that it is guaranteed to produce values in the range [0,1] (if you start with dual numbers with real parts that are ordinary probabilities) and reproduces the usual behaviour when given ordinary probabilities. <P><BR>Here's a simple expression using a sample from "<img src="https://chart.googleapis.com/chart?cht=tx&chl=B%282%29" style="vertical-align:middle">": <P><BR><pre><br />> test2 = do<br />> a <- bernoulli 2<br />> return $ if a==1 then 2.0 else 1.0<br /><P><BR><br /></pre>It's expected value is 3. We can get this exactly using <tt>expect2 test2</tt>. For a Monte Carlo estimate use <P><BR><pre><br />R.getStdRandom (expect3 (\back p -> abs p/(abs p+abs (1-p))) test2 1000)<br /><P><BR><br /></pre>Note that estimates involving negative probabilities can have quite high variances so try a few times until you get something close to 3 :-) <P><BR>We don't have to stick with real numbers. We can use this approach to estimate with complex probabilities (aka quantum mechanics) or other algebraic structures. <P><BR><BR><b>Discrete yet differentiable</b><p>And now comes the trick: automatic differentiation uses the algebra of dual numbers. It's not obvious at all what a probability like <img src="https://chart.googleapis.com/chart?cht=tx&chl=0.5%2B%5Cepsilon" style="vertical-align:middle"> means when <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cepsilon" style="vertical-align:middle"> is infinitesimal. However, we can use <tt>interpret3</tt> to give it meaningful semantics. <P><BR>Let'd define the duals in the usual way first: <P><BR><pre><br />> data Dual a = D { real :: a, infinitesimal :: a }<br /><P><BR><br />> instance (Ord a, Num a) => Num (Dual a) where<br />> D a b + D a' b' = D (a+a') (b+b')<br />> D a b * D a' b' = D (a*a') (a*b'+a'*b)<br />> negate (D a b) = D (negate a) (negate b)<br />> abs (D a b) = if a > 0 then D a b else D (-a) (-b)<br />> signum (D a b) = D (signum a) 0<br />> fromInteger a = D (fromInteger a) 0<br /><P><BR><br />> instance (Ord a, Fractional a) => Fractional (Dual a) where<br />> fromRational a = D (fromRational a) 0<br />> recip (D a b) = let ia = 1/a in D ia (-b*ia*ia)<br /><P><BR><br />> instance Show a => Show (Dual a) where<br />> show (D a b) = show a ++ "[" ++ show b ++ "]"<br /><P><BR><br /></pre>Now we can use the rule <tt>real</tt> to give as a real-valued probability from a dual number. The function <tt>expect3</tt> will push the infinitesimal part into the importance value so it doesn't get forgotten about. And now <tt>expect3</tt> gives us an estimate that is differentiable despite the fact that our random variables are discrete. <P><BR>Let's try an expression: <P><BR><pre><br />> test3 p = do<br />> a <- bernoulli p<br />> b <- bernoulli p<br />> return $ if a == 1 && b == 1 then 1.0 else 0.0<br /><P><BR><br /></pre>The expected value is <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%5E2" style="vertical-align:middle"> and the derivative is <img src="https://chart.googleapis.com/chart?cht=tx&chl=2p" style="vertical-align:middle">. We can evaluate at <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%3D0.5" style="vertical-align:middle"> with <tt>expect2 (test3 (D 0.5 1))</tt>. And we can estimate it with <P><BR><pre><br />R.getStdRandom (expect3 real (test4 (D 0.5 1)) 1000)<br /><P><BR><br /></pre>What's neat is that we can parameterise our distributions in a more complex way and we can freely mix with conventional expressions in our parameter. Here's an example: <P><BR><pre><br />> test4 p = do<br />> a <- bernoulli p<br />> b <- bernoulli (p*p)<br />> return $ p*fromIntegral a*fromIntegral b<br /><P><BR><br /></pre>Try evaluating <tt>expect2 (test4 (D 0.5 1))</tt> and <pre><br />R.getStdRandom (expect3 real (test4 (D 0.5 1)) 1000)<br /><P><BR><br /></pre>I've collected the above examples together here: <P><BR><pre><br />> main = do<br />> print =<< R.getStdRandom (interpret1 test1)<br />> print $ collect $ interpret2 test1<br />> print =<< R.getStdRandom (expect1 test1 1000)<br />> print (expect2 test1)<br />> print =<< R.getStdRandom (expect3 id test1 1000)<br />> print =<< R.getStdRandom (expect3 (const 0.5) test1 1000)<br />> print "---"<br />> print $ expect2 test2<br />> print =<< R.getStdRandom (expect3 (\p -> abs p/(abs p+abs (1-p))) test2 1000)<br />> print "---"<br />> print $ expect2 (test3 (D 0.5 1))<br />> print =<< R.getStdRandom (expect3 real (test3 (D 0.5 1)) 1000)<br />> print "---"<br />> print $ expect2 (test4 (D 0.5 1))<br />> print =<< R.getStdRandom (expect3 real (test4 (D 0.5 1)) 1000)<br /><P><BR><br /></pre><BR><b>What just happened?</b><p>You can think of a dual number as a real number that has been infinitesimally slightly <a href="https://en.wikipedia.org/wiki/Deformation_theory">deformed</a>. To differentiate something we need to deform something. But we can't deform 0 or 1 and have them stay 0 or 1. So the trick is to embed probability sampling in something "bigger", namely importance sampling, where samples carry around an importance value. This bigger thing does allow infinitesimal deformations. And that allows differentiation. This process of turning something discrete into something continuously "deformable" is generally called <a href="https://en.wikipedia.org/wiki/Relaxation_%28approximation%29">relaxation</a>. <P><BR><BR><b>Implementation details</b><p>I've made no attempt to make my code fast. However I don't think there's anything about this approach that's incompatible with performance. There's no need to use a monad. Instead you can track the importance value through your code by hand and implement everything in C. Additionally, I've previously <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.159.4990&rep=rep1&type=pdf">written</a> about the fact that for any trick involving forward mode AD there is another corresponding trick you can use with reverse mode AD. So this method is perfectly comptible with back-propagation. Note also that the dual number importances always have real part 1 which means you don't actually need to store them. <P><BR>The bad news is that the derivative estimate can sometimes have a high variance. Nonetheless, I've used it successfully for some toy optimisation problems. I don't know if this approach is effective for industrial strength problems. Your mileage may vary :-) <P><BR><BR><b>Alternatives</b><p>Sometimes you may find that it is acceptable to deform the samples from your discrete distribution. In that case you can use the <a href="https://arxiv.org/abs/1611.00712">concrete relaxation</a>. <P><BR><BR><b>Continuous variables</b><p>The above method can be adapted to work with continuous variables. There is a non-trivial step which I'll leave as an exercise but I've tested it in some Python code. I think it reproduces a <a href="http://blog.shakirm.com/2015/11/machine-learning-trick-of-the-day-5-log-derivative-trick/">standard technique</a> and it gives an alternative way to think about that trick. That article is also useful for ways to deal with the variance issues. Note also that importance sampling is normally used itself as a variance reduction technique. So there are probably helpful ways to modify the <tt>rule</tt> argument to <tt>interpret3</tt> to simultaneously estimate derivatives and keep the variance low. <P><BR><BR><b>Personal note</b><p>I've thought about this problem a couple of times over the years. Each time I've ended up thinking "there's no easy way to extend AD to work with random variables so don't waste any more time thinking about it". So don't listen to anything I say. Also, I like that this method sort of comes "for free" once you combine methods I've described previously. <P><BR><BR><b>Acknowledgements</b><p>I think it was Eric Kidd's paper on <a href="http://www.randomhacks.net/files/build-your-own-probability-monads.pdf">building probability monads</a> that first brought to my attention that there are many kinds of semantics you can use with probability theory - i.e. there are many interpreters you can write for the <tt>Random</tt> monad. I think there is an interesting design space worth exploring here. <P><BR><BR><b>Answer to exercise</b><p>I set the continuous case as an exercise above. Here is a solution. <P><BR>Suppose you're sampling from a distribution parameterised by <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctheta" style="vertical-align:middle"> with pdf <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%28%5Ctheta%2C%20%5Ccdot%29" style="vertical-align:middle">. To compute the derivative with respect to <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctheta" style="vertical-align:middle"> you need to consider sampling from <img src="https://chart.googleapis.com/chart?cht=tx&chl=p%28%5Ctheta%2B%5Cepsilon%2C%20%5Ccdot%29" style="vertical-align:middle"> where <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cepsilon" style="vertical-align:middle"> is an infinitesimal. <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=p%28%5Ctheta%2B%5Cepsilon%2C%20x%29%20%3D%20p%28%5Ctheta%2C%20x%29%2B%5Cepsilon%5Cfrac%7B%5Cpartial%20p%28%5Ctheta%2C%20x%29%7D%7B%5Cpartial%5Ctheta%7D" style="vertical-align:middle">. </blockquote>As we don't know how to sample from a pdf with infinitesimals in it, we instead sample using <img src="https://chart.googleapis.com/chart?cht=tx&chl=p" style="vertical-align:middle"> as usual, but use an importance of <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%28p%28%5Ctheta%2C%20x%29%2B%5Cepsilon%5Cfrac%7B%5Cpartial%20p%28%5Ctheta%2C%20x%29%7D%7B%5Cpartial%5Ctheta%7D%29%2Fp%28%5Ctheta%2C%20x%29%20%3D%201%2B%5Cepsilon%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%5Ctheta%7D%5Clog%20p%28%5Ctheta%2C%20x%29" style="vertical-align:middle"></blockquote>The coefficient of the <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cepsilon" style="vertical-align:middle"> gives the derivative. So we need to compute the expectation, scaling each sample with this coefficient. In other words, to estimate <img src="https://chart.googleapis.com/chart?cht=tx&chl=E%28f%28X%29%29" style="vertical-align:middle"> we use <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7B1%7D%7BN%7D%5Csum_%7Bi%3D1%7D%5EN%20f%28x_i%29%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%5Ctheta%7D%5Clog%20p%28%5Ctheta%2C%20x_i%29" style="vertical-align:middle"> </blockquote>where the <img src="https://chart.googleapis.com/chart?cht=tx&chl=x_i" style="vertical-align:middle"> are drawn from the original distribution. This is exactly what is described at <a href="http://blog.shakirm.com/2015/11/machine-learning-trick-of-the-day-5-log-derivative-trick/">Shakir Mohamed's blog</a>. <P><BR><BR><b>Final word</b><p>I managed to find the method in the literature. It's part of the <a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf">REINFORCE method</a>. For example, see equation (5) there. sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com7tag:blogger.com,1999:blog-11295132.post-82660760361962124902017-02-05T10:30:00.000-08:002017-02-16T07:10:25.685-08:00Logarithms and exponentials of functions<BR><b>Introduction</b><p>A popular question in mathematics is this: given a function <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle">, what is its "square root" <img src="https://chart.googleapis.com/chart?cht=tx&chl=g" style="vertical-align:middle"> in the sense that <img src="https://chart.googleapis.com/chart?cht=tx&chl=g%28g%28x%29%29%20%3D%20f%28x%29" style="vertical-align:middle">. There are many questions about this on <a href="http://mathoverflow.net/questions/tagged/fractional-iteration">mathoverflow</a> but it's also a popular subject in mathematics forums for non-experts. This question seems to have a certain amount of notoriety because it's easy to ask but hard to answer fully. I want to look at an approach that works nicely for formal power series, following from the Haskell code I wrote <a href="http://blog.sigfpe.com/2007/11/small-combinatorial-library.html">here</a>. There are some methods for directly finding "functional square roots" for formal power series that start as <img src="https://chart.googleapis.com/chart?cht=tx&chl=z%2Ba_2z%5E2%2Ba_3z%5E3%2B%5Cldots" style="vertical-align:middle">, but I want to approach the problem indirectly. When working with real numbers we can find square roots, say, by using <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Csqrt%7Bx%7D%3D%5Cexp%28%5Cfrac%7B1%7D%7B2%7D%5Clog%7Bx%7D%29" style="vertical-align:middle">. I want to use an analogue of this for functions. So my goal is to make sense of the idea of the logarithm and exponential of a formal power series as composable functions. Warning: the arguments are all going to be informal. <P><BR><BR><b>Notation</b><p>There's potential for a lot of ambiguous notation here, especially as the usual mathematical notation for <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle">th powers of trig functions is so misleading. I'm going to use <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ccirc" style="vertical-align:middle"> for composition of functions and power series, and I'm going to use the notation <img src="https://chart.googleapis.com/chart?cht=tx&chl=f%5E%7B%5Ccirc%20n%7D" style="vertical-align:middle"> to mean the <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle">th iterate of <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle">. So <img src="https://chart.googleapis.com/chart?cht=tx&chl=f%5E%7Bn%2B1%7D%28x%29%20%3D%20f%28x%29f%5En%28x%29" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=f%5E%7B%5Ccirc%20n%2B1%7D%28x%29%20%3D%20f%28f%5E%7B%5Ccirc%20n%7D%28x%29%29" style="vertical-align:middle">. As I'll be working mostly in the ring of formal power series <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%5B%5C%21%5Bz%5D%5C%21%5D" style="vertical-align:middle"> for some ring <img src="https://chart.googleapis.com/chart?cht=tx&chl=R" style="vertical-align:middle">, I'll reserve the variable <img src="https://chart.googleapis.com/chart?cht=tx&chl=z" style="vertical-align:middle"> to refer only to the corresponding element in this ring. I'll also use formal power series somewhat interchangeably with functions. So <img src="https://chart.googleapis.com/chart?cht=tx&chl=z" style="vertical-align:middle"> can be thought of as representing the identity function. To make sure we're on the same page, here are some small theorems in this notation: <ol><li><img src="https://chart.googleapis.com/chart?cht=tx&chl=z%5Emz%5En%20%3D%20z%5E%7Bm%2Bn%7D" style="vertical-align:middle"></li><li><img src="https://chart.googleapis.com/chart?cht=tx&chl=f%5E%7B%5Ccirc%20m%7D%5Ccirc%20f%5E%7B%5Ccirc%20n%7D%20%3D%20f%5E%7B%5Ccirc%20m%2Bn%7D" style="vertical-align:middle"></li><li><img src="https://chart.googleapis.com/chart?cht=tx&chl=%281%2Bz%29%5En%20%3D%20%5Csum_%7Bi%3D0%7D%5En%7Bn%5Cchoose%20i%7Dz%5En" style="vertical-align:middle"></li><li><img src="https://chart.googleapis.com/chart?cht=tx&chl=%281%2Bz%29%5E%7B%5Ccirc%20n%7D%3Dn%2Bz" style="vertical-align:middle">.</li></ol>That last one simply says that adding one <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> times is the same as adding <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle">. <P><BR>As I'm going to have ordinary logarithms and exponentials sitting around, as well as functional logarithms and exponentials, I'm going to introduce the notation <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BLOG%7D" style="vertical-align:middle"> for functional logarithm and <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D" style="vertical-align:middle"> for functional exponentiation. <P><BR><BR><b>Preliminaries</b><p>The first goal is to define a non-trivial function <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BLOG%7D" style="vertical-align:middle"> with the fundamental property that <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BLOG%7D%28f%5E%7B%5Ccirc%20n%7D%29%3Dn%5Coperatorname%7BLOG%7D%28f%29" style="vertical-align:middle"><P><BR>First, let's note some basic algebraic facts. The formal power series form a commutative ring with operations <img src="https://chart.googleapis.com/chart?cht=tx&chl=%2B" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ccdot" style="vertical-align:middle"> (ordinary multiplication) and with additive identity <img src="https://chart.googleapis.com/chart?cht=tx&chl=0" style="vertical-align:middle"> and multiplicative identity <img src="https://chart.googleapis.com/chart?cht=tx&chl=1" style="vertical-align:middle">. The formal power series form a ring-like algebraic structure with operation <img src="https://chart.googleapis.com/chart?cht=tx&chl=%2B" style="vertical-align:middle"> and partial operation <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ccirc" style="vertical-align:middle"> with additive identity <img src="https://chart.googleapis.com/chart?cht=tx&chl=0" style="vertical-align:middle"> and multiplicative identity <img src="https://chart.googleapis.com/chart?cht=tx&chl=z" style="vertical-align:middle">. But it's not actually ring or even a <a href="https://en.wikipedia.org/wiki/Near-ring">near-ring</a>. Composition isn't defined for all formal power series and even when it's defined, we don't have distributivity. For example, in general <img src="https://chart.googleapis.com/chart?cht=tx&chl=f%5Ccirc%28g%2Bh%29%5Cne%20f%5Ccirc%20g%2Bf%5Ccirc%20h" style="vertical-align:middle">, after all there's no reason to expect <img src="https://chart.googleapis.com/chart?cht=tx&chl=f%28g%28x%29%2Bh%28x%29%29" style="vertical-align:middle"> to equal <img src="https://chart.googleapis.com/chart?cht=tx&chl=f%28g%28x%29%29%2Bf%28h%28x%29%29" style="vertical-align:middle">. We do have right-distributivity however, i.e. <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%28f%2Bg%29%5Ccirc%20h%20%3D%20f%5Ccirc%20g%2Bf%5Ccirc%20h" style="vertical-align:middle">, </blockquote>because <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%28f%2Bg%29%28h%28x%29%29%3Df%28h%28x%29%29%2Bg%28h%28x%29%29" style="vertical-align:middle">, </blockquote>more or less by definition of <img src="https://chart.googleapis.com/chart?cht=tx&chl=%2B" style="vertical-align:middle">. <P><BR><BR><b>We can't use power series on our power series</b><p>There's an obvious approach, just use power series of power series. So we might tentatively suggest that <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BLOG%7D%28z%2Bf%29%20%3D%20f-%5Cfrac%7B1%7D%7B2%7Df%5E%7B%5Ccirc%202%7D%2B%5Cfrac%7B1%7D%7B3%7Df%5E%7B%5Ccirc%203%7D%2B%5Cldots" style="vertical-align:middle">. </blockquote>Note that I consider <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BLOG%7D%28z%2Bf%29" style="vertical-align:middle"> rather than <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BLOG%7D%281%2Bf%29" style="vertical-align:middle"> because <img src="https://chart.googleapis.com/chart?cht=tx&chl=z" style="vertical-align:middle"> is the multiplicative identity in our ring-like structure. <P><BR>Unfortunately this doesn't work. The reason is this: if we try to use standard reasoning to show that the resulting function has the fundamental property we seek we end up using distributivity. We don't have distributivity. <P><BR><BR><b>Sleight of hand</b><p>There's a beautiful trick I spotted on mathoverflow recently that allows us to bring back distributivity. (I can't find the trick again, but when I do I'll come back and add a link and credit here.) Consider the function <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%28g%29" style="vertical-align:middle"> defined by <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%28g%29%28f%29%20%3D%20f%5Ccirc%20g" style="vertical-align:middle">. In other words <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%28g%29" style="vertical-align:middle"> is right-composition by <img src="https://chart.googleapis.com/chart?cht=tx&chl=g" style="vertical-align:middle">. (Ambiguity alert, I'm using <img src="https://chart.googleapis.com/chart?cht=tx&chl=R" style="vertical-align:middle"> here to mean <i>right</i>. It has nothing to do with the ring underlying our formal power series.) Because we have right-distributivity, <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%28g%29" style="vertical-align:middle"> is a <i>bona fide</i> linear operator on the space of formal power series. If you think of formal power series as being infinitely long vectors of coefficients then <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%28g%29" style="vertical-align:middle"> can be thought of as an infinitely sized matrix. This means that as long as we have convergence, we can get away with using power series to compute <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Clog%20R%28g%29" style="vertical-align:middle"> with the property that <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Clog%28R%28g%29%5En%29%20%3D%20n%5Clog%20R%28g%29" style="vertical-align:middle">. Define: <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperator%7BLOG%7D%28f%29%20%3D%20%5Clog%28R%28f%29%29z" style="vertical-align:middle">. </blockquote>We have: <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperator%7BLOG%7D%28f%29%20%3D%20%5Clog%28R%28f%29%29z%20%3D%20%5Clog%281%2B%28R%28f%29-1%29%29z" style="vertical-align:middle"></blockquote>where I'm using <img src="https://chart.googleapis.com/chart?cht=tx&chl=1" style="vertical-align:middle"> to mean the identity linear operator. And now have: <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperator%7BLOG%7D%28f%29%20%3D%20%28R%28f%29-1%29z-%5Cfrac%7B1%7D%7B2%7D%28R%28f%29-1%29%5E2z%2B%5Cfrac%7B1%7D%7B3%7D%28R%28f%29-1%29%5E3z%2B%5Cldots" style="vertical-align:middle">. </blockquote>But does it converge? Suppose <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle"> is of the form <img src="https://chart.googleapis.com/chart?cht=tx&chl=x%2Ba_2x%5E2%2Ba_3x%5E3%2B%5Cldots" style="vertical-align:middle">. Then <img src="https://chart.googleapis.com/chart?cht=tx&chl=%28R%28f%29-1%29g%20%3D%20g%5Ccirc%20f-g" style="vertical-align:middle">. The leading term in <img src="https://chart.googleapis.com/chart?cht=tx&chl=g%5Ccirc%20f" style="vertical-align:middle"> is the same as the leading term in <img src="https://chart.googleapis.com/chart?cht=tx&chl=g" style="vertical-align:middle">. So <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%28f%29-1" style="vertical-align:middle"> kills the first term of whatever it is applied to, which means that when we sum the terms in <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BLOG%7D%28f%29" style="vertical-align:middle">, we only need <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> to get a power series correct to <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> coefficients. Reusing my code from <a href="http://blog.sigfpe.com/2007/11/small-combinatorial-library.html">here</a>, I call <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BLOG%7D" style="vertical-align:middle"> by the name <tt>flog</tt>. Here is its implementation: <P><BR><pre><br />> import Data.Ratio<br /><P><BR><br />> flog :: (Eq a, Fractional a) => [a] -> [a]<br />> flog f@(0 : 1 : _) =<br />> flog' 1 (repeat 0) (0 : 1 : repeat 0)<br />> where flog' n total term = take (n+1) total ++ (<br />> drop (n+1) $<br />> let pz = p term<br />> in flog' (n+1) (total-map (((-1)^n / fromIntegral n) *) pz) pz)<br />> p total = (total ○ f) - total<br /><P><BR><br /></pre>The <tt>take</tt> and <tt>drop</tt> are how I tell Haskell when the first <img src="https://chart.googleapis.com/chart?cht=tx&chl=n%2B1" style="vertical-align:middle"> coefficients have been exactly computed and so no more terms are necessary. <P><BR>Does it work? <P><BR>Here's an example using the twice iterated sin function: <P><BR><pre><br />> ex1 = do<br />> let lhs = flog (sin (sin z))<br />> let rhs = 2*flog (sin z)<br />> mapM_ print $ take 20 (lhs-rhs)<br /><P><BR><br /></pre>Works to 20 coefficients. Dare we try an inverse function? <P><BR><pre><br />> ex2 = do<br />> let lhs = flog (sin z)<br />> let rhs = flog (asin z)<br />> mapM_ print $ take 20 (lhs+rhs)<br /><P><BR><br /></pre>Seems to work! <P><BR><BR><b>Exponentials</b><p>It's no good having logarithms if we can't invert them. One way to think about the exponential function is that <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cexp%28x%29%20%3D%20%5Clim_%7Bn%5Crightarrow%20%5Cinfty%7D%281%2B%5Cfrac%7Bx%7D%7Bn%7D%29%5En" style="vertical-align:middle"></blockquote>We get better and better approximations by writing the expression inside the limit as a product of more and more terms. We can derive the usual power series for <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cexp" style="vertical-align:middle"> from this, but only if right-distributivity holds. So let's try to use the above expression directly: <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D%28f%29%20%3D%20%5Clim_%7Bn%5Crightarrow%20%5Cinfty%7D%28z%2B%5Cfrac%7Bf%7D%7Bn%7D%29%5E%7B%5Ccirc%20n%7D" style="vertical-align:middle"></blockquote>and get <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D%28f%29%20%3D%20%5Clim_%7Bn%5Crightarrow%20%5Cinfty%7DR%28z%2B%5Cfrac%7Bf%7D%7Bn%7D%29%5Enz" style="vertical-align:middle">. </blockquote>Unfortunately, even though <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%28g%29" style="vertical-align:middle"> is linear, <img src="https://chart.googleapis.com/chart?cht=tx&chl=R" style="vertical-align:middle"> itself isn't. So it's going to take some extra work to raise <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%28z%2Bf%2Fn%29" style="vertical-align:middle"> to the power of <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle">. <P><BR>The good news is that we're dealing with the special case <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%28z%2B%5Cepsilon%29" style="vertical-align:middle"> where <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cepsilon" style="vertical-align:middle"> is something small. We have <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=R%28z%2B%5Cepsilon%29f%3Df%28z%2B%5Cepsilon%29%3Df%28z%29%2B%5Cepsilon%5Cfrac%7Bdf%7D%7Bdz%7D%2BO%28%5Cepsilon%5E2%29" style="vertical-align:middle">. </blockquote>So <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%28z%2Bf%2Fn%29" style="vertical-align:middle"> is actually <img src="https://chart.googleapis.com/chart?cht=tx&chl=1%2B%5Cfrac%7B1%7D%7Bn%7Df%5Cfrac%7Bd%7D%7Bdz%7D" style="vertical-align:middle"> modulo higher order terms. This gives us <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D%28f%29%20%3D%20%5Clim_%7Bn%5Crightarrow%20%5Cinfty%7D%281%2B%5Cfrac%7B1%7D%7Bn%7Df%5Cfrac%7Bd%7D%7Bdz%7D%29%5Enz%3D%5Cexp%28f%5Cfrac%7Bd%7D%7Bdz%7D%29z" style="vertical-align:middle">. </blockquote>This is something we can implement using the power series for ordinary <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cexp" style="vertical-align:middle">: <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D%28f%29%20%3D%20z%2Bf%2B%5Cfrac%7B1%7D%7B2%21%7Df%5Cfrac%7Bdf%7D%7Bdz%7D%2B%5Cfrac%7B1%7D%7B3%21%7Df%5Cfrac%7Bd%7D%7Bdz%7D%28f%5Cfrac%7Bdf%7D%7Bdz%7D%29%2B%5Cldots" style="vertical-align:middle">. </blockquote>In code that becomes: <P><BR><pre><br />> fexp f@(0 : 0 : _) = fexp' f 0 z 1<br />> fexp' f total term n = take (n-1) total ++ drop (n-1)<br />> (fexp' f (total+term) (map (/fromIntegral n) (f*d term)) (n+1))<br /><P><BR><br /></pre>Note how when we differentiate a power series we shift the coefficients down by one place. To counter the effect of that so as to ensure convergence we need <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle"> to look like <img src="https://chart.googleapis.com/chart?cht=tx&chl=a_2z%5E2%2Ba_3a%5E3%2B%5Cldots" style="vertical-align:middle">. Luckily this is exactly the kind of series <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BLOG%7D" style="vertical-align:middle"> gives us. <P><BR>But does it successfully invert <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BLOG%7D" style="vertical-align:middle">? Let's try: <P><BR><pre><br />> ex3 = do<br />> let lhs = sin z<br />> let rhs = fexp (flog (sin z))<br />> mapM_ print $ take 20 (lhs-rhs)<br /><P><BR><br /></pre>Now we can start computing fractional iterates. Square root first: <P><BR><pre><br />> ex4 = do<br />> mapM_ print $ take 20 $ fexp (flog (sin z)/2)<br /><P><BR><br /></pre>That matches the results at <a href="http://oeis.org/A048602">A048602</a> and <a href="http://oeis.org/A048603">A048603</a>. <P><BR>Cube root: <P><BR><pre><br />> ex5 = do<br />> mapM_ print $ take 20 $ fexp (flog (sin z)/3)<br /><P><BR><br /></pre>Matches <a href="http://oeis.org/A052132">A052132</a> and <a href="http://oeis.org/A052135">A052135</a>. <P><BR>And this gives an alternative to Lagrange inversion for computing power series for inverse functions: <P><BR><pre><br />> ex6 = do<br />> let lhs = fexp (-flog (sin z))<br />> let rhs = asin z<br />> mapM_ print $ take 20 (lhs-rhs)<br /><P><BR><br /></pre><BR><b>What's really going on with <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D" style="vertical-align:middle">?</b><p>Let's approach <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D" style="vertical-align:middle"> in a slightly different way. In effect, <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D" style="vertical-align:middle"> is the composition of <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> lots of <img src="https://chart.googleapis.com/chart?cht=tx&chl=z%2B%5Cfrac%7Bf%7D%7Bn%7D" style="vertical-align:middle"> with <img src="https://chart.googleapis.com/chart?cht=tx&chl=z" style="vertical-align:middle">. So let's try composing these one at a time, with one composition every <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7B1%7D%7Bn%7D" style="vertical-align:middle"> seconds. After one second we should have our final result. We can write this as: <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=g%280%29%20%3D%20z" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=g%28t%2B%5Cfrac%7B1%7D%7Bn%7D%29%20%3D%20g%28t%29%2B%5Cfrac%7B1%7D%7Bn%7Df%28g%28t%29%29" style="vertical-align:middle"> to first order. </blockquote>So we're solving the differential equation: <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=g%280%29%20%3D%20z" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Bdg%7D%7Bdt%7D%20%3D%20f%28g%28t%29%29" style="vertical-align:middle"></blockquote>with <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D%28g%29%20%3D%20g%281%29" style="vertical-align:middle">. <P><BR>So <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D" style="vertical-align:middle"> is the function that solves one of the most fundamental differential equations. This also means I can use Mathematica to solve symbolically and check my results. For example, Mathematica says that the solution to <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Bdg%7D%7Bdt%7D%3Dsin%28g%28t%29%29%5E2" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=g%280%29%3Dx" style="vertical-align:middle"></blockquote>at <img src="https://chart.googleapis.com/chart?cht=tx&chl=t%3D1" style="vertical-align:middle"> is <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=g%281%29%20%3D%20%5Cfrac%7B%5Ctan%20z%7D%7B1-%5Ctan%20z%7D" style="vertical-align:middle"></blockquote>so let's check: <P><BR><pre><br />> ex7 = do<br />> let lhs = fexp ((sin z)^2)<br />> let rhs = atan (tan z/(1-tan z))<br />> mapM_ print $ take 20 (lhs-rhs)<br /><P><BR><br /></pre>I like this example because it leads to the generalized Catalan numbers <a href="http://oeis.org/A004148">A004148</a>: <P><BR><pre><br />> ex8 = do<br />> mapM_ print $ take 20 $ fexp (z^2/(1-z^2))<br /><P><BR><br /></pre>That suggests this question: what does <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D" style="vertical-align:middle"> mean combinatorially? I don't have a straightforward answer but solving this class of differential equation motivated the original introduction, by Cayley, of the abstract notion of a tree. See <a href="https://arxiv.org/abs/1512.00906">here</a>. <P><BR><BR><b>What is going on geometrically?</b><p>For those who know some differential geometry, The differential equation <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=g%280%29%20%3D%20z" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Bdg%7D%7Bdt%7D%20%3D%20f%28g%28t%29%29" style="vertical-align:middle"></blockquote>describes a flow on the real line (or complex plane). You can think of <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle"> as being a one-dimensional vector field describing how points move from time <img src="https://chart.googleapis.com/chart?cht=tx&chl=t" style="vertical-align:middle"> to <img src="https://chart.googleapis.com/chart?cht=tx&chl=t%2Bdt" style="vertical-align:middle">. When we solve the differential equation we get <a href="https://en.wikipedia.org/wiki/Integral_curve">integral curves</a> that these points follow and <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D" style="vertical-align:middle"> tells us where the points end up after one unit of time. So <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D" style="vertical-align:middle"> is the <a href="https://en.wikipedia.org/wiki/Exponential_map_%28Riemannian_geometry%29">exponential map</a>. In fact, <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D%28f%29%3D%5Cexp%28f%5Cfrac%7Bd%7D%7Bdz%7D%29z" style="vertical-align:middle"> is essentially the exponential of the vector field <img src="https://chart.googleapis.com/chart?cht=tx&chl=f%5Cfrac%7Bd%7D%7Bdz%7D" style="vertical-align:middle"> where we're now using the differential geometer's notion of a vector field as a differential operator. <P><BR><BR><b>Final word</b><p>Unfortunately the power series you get from using <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperator%7BLOG%7D" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperator%7BEXP%7D" style="vertical-align:middle"> don't always have good convergence properties. For example, I'm not sure but I think the series for <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Csin%5E%7B%5Ccirc%201%2F2%7D%20z" style="vertical-align:middle"> has radius of convergence zero. If you truncate the series you get a half-decent approximaion to a square root in the vicinity of the origin, but the approximation gets worse, not better, if you use more terms. <P><BR><BR><b>And the rest of the code</b><p><P><BR><pre><br />> (*!) _ 0 = 0<br />> (*!) a b = a*b<br />> (!*) 0 _ = 0<br />> (!*) a b = a*b<br />> (^+) a b = zipWith (+) a b<br />> (^-) a b = zipWith (-) a b<br /><P><BR><br />> ~(a:as) ⊗ (b:bs) = (a *! b):<br />> ((map (a !*) bs) ^+ (as ⊗ (b:bs)))<br />> (○) (f:fs) (0:gs) = f:(gs ⊗ (fs ○ (0:gs)))<br />> inverse (0:f:fs) = x where x = map (recip f *) (0:1:g)<br />> _:_:g = map negate ((0:0:fs) ○ x)<br />> invert x = r where r = map (/x0) ((1:repeat 0) ^- (r ⊗ (0:xs)))<br />> x0:xs = x <br /><P><BR><br />> (^/) (0:a) (0:b) = a ^/ b<br />> (^/) a b = a ⊗ (invert b)<br /><P><BR><br />> z :: [Rational]<br />> z = 0:1:repeat 0<br /><P><BR><br />> d (_:x) = zipWith (*) (map fromInteger [1..]) x<br /><P><BR><br />> integrate x = 0 : zipWith (/) x (map fromInteger [1..])<br /><P><BR><br />> instance (Eq r, Num r) => Num [r] where<br />> x+y = zipWith (+) x y<br />> x-y = zipWith (-) x y<br />> ~x*y = x ⊗ y<br />> fromInteger x = fromInteger x:repeat 0<br />> negate x = map negate x<br />> signum (x:_) = signum x : repeat 0<br />> abs (x:xs) = error "Can't form abs of a power series"<br /><P><BR><br />> instance (Eq r, Fractional r) => Fractional [r] where<br />> x/y = x ^/ y<br />> fromRational x = fromRational x:repeat 0<br /><P><BR><br />> sqrt' x = 1 : rs where rs = map (/2) (xs ^- (rs ⊗ (0:rs)))<br />> _ : xs = x<br />> instance (Eq r, Fractional r) => Floating [r] where<br />> sqrt (1 : x) = sqrt' (1 : x)<br />> sqrt _ = error "Can only find sqrt when leading term is 1"<br />> exp x = e where e = 1+integrate (e * d x)<br />> log x = integrate (d x/x)<br />> sin x = integrate ((cos x)*(d x))<br />> cos x = [1] ... negate (integrate ((sin x)*(d x)))<br />> asin x = integrate (d x/sqrt(1-x*x))<br />> atan x = integrate (d x/(1+x*x))<br />> acos x = error "Unable to form power series for acos"<br />> sinh x = integrate ((cosh x)*(d x))<br />> cosh x = [1] ... integrate ((sinh x)*(d x))<br />> asinh x = integrate (d x/sqrt(1+x*x))<br />> atanh x = integrate (d x/(1-x*x))<br />> acosh x = error "Unable to form power series for acosh"<br />> pi = error "There is no formal power series for pi"<br /><P><BR><br />> lead [] x = x<br />> lead (a:as) x = a : (lead as (tail x))<br />> a ... x = lead a x<br /><P><BR><br />> (//) :: Fractional a => [a] -> (Integer -> Bool) -> [a]<br />> (//) a c = zipWith (\a-> \b->(if (c a :: Bool) then b else 0)) [(0::Integer)..] a<br /><P><BR><br /></pre>A direct functional square root that doesn't use <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BLOG%7D" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Coperatorname%7BEXP%7D" style="vertical-align:middle">: <P><BR><pre><br />> fsqrt (0 : 1 : fs) =<br />> let gs = (fs-(0 : gs*((0 : delta gs gs)+((2 : gs)*(gs*g)))))/2<br />> g = 0 : 1 : gs<br />> delta (g : gs) h = let g' = delta gs h<br />> in (0 : ((1 : h) * g')) + gs<br />> in g<br /></pre>sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com7tag:blogger.com,1999:blog-11295132.post-81374390082943687642017-01-09T08:33:00.000-08:002017-01-09T08:33:19.385-08:00Building free arrows from components<BR><b>Introduction</b><p>Gabriel Gonzalez has written quite a bit about the <a href="http://www.haskellforall.com/2012/07/purify-code-using-free-monads.html">practical applications of free monads</a>. And "haoformayor" wrote a <a href="http://stackoverflow.com/a/38172390">great stackoverflow post</a> on how arrows are related to strong profunctors. So I thought I'd combine these and apply them to arrows built from profunctors: free arrows. What you get is a way to use arrow notation to build programs, but defer the interpretation of those programs until later. <P><BR><BR><b>Heteromorphisms</b><p>Using the notation <a href="https://ncatlab.org/nlab/show/profunctor">here</a> I'm going to call an element of a type <tt>P a b</tt>, where <tt>P</tt> is a profunctor, a <i>heteromorphism</i>. <P><BR><BR><b>A product that isn't much of a product</b><p>As I described a <a href="http://blog.sigfpe.com/2011/07/profunctors-in-haskell.html">while back</a> you can compose profunctors. Take a look at the code I used, and also <a href="http://hackage.haskell.org/package/Chart-gtk-1.1/docs/Data-Profunctor-Composition.html">Data.Functor.Composition</a>. <P><BR><pre><br />data Compose f g d c = forall a. Compose (f d a) (g a c)<br /><P><BR><br /></pre>An element of <tt>Compose f g d c</tt> is just a pair of heteromorphisms, one from each of the profunctors, <tt>f</tt> and <tt>g</tt>, with the proviso that the "output" type of one is compatible with the "input" type of the other. As products go it's pretty weak in the sense that no composition happens beyond the two objects being stored with each other. And that's the basis of what I'm going to talk about. The <tt>Compose</tt> type is just a placeholder for pairs of heteromorphisms whose actual "multiplication" is being deferred until later. This is similar to the situation with the free monoid, otherwise known as a list. We can "multiply" two lists together using <tt>mappend</tt> but all that really does is combine the elements into a bigger list. The elements themselves aren't touched in any way. That suggests the idea of using profunctor composition in the same way that <tt>(:)</tt> is used to pair elements and lists. <P><BR><BR><b>Free Arrows</b><p>Here's some code: <P><BR><pre><br />> {-# OPTIONS -W #-}<br />> {-# LANGUAGE ExistentialQuantification #-}<br />> {-# LANGUAGE Arrows #-}<br />> {-# LANGUAGE RankNTypes #-}<br />> {-# LANGUAGE TypeOperators #-}<br />> {-# LANGUAGE FlexibleInstances #-}<br /><P><BR><br />> import Prelude hiding ((.), id)<br />> import Control.Arrow<br />> import Control.Category<br />> import Data.Profunctor<br />> import Data.Monoid<br /><P><BR><br />> infixr :-<br /><P><BR><br />> data FreeA p a b = PureP (a -> b)<br />> | forall x. p a x :- FreeA p x b<br /><P><BR><br /></pre>First look at the second line of the definition of <tt>FreeA</tt>. It says that a <tt>FreeA p a b</tt> might be a pair consisting of a head heteromorphism whose output matches the input of another <tt>FreeA</tt>. There's also the <tt>PureP</tt> case which is acting like the empty list <tt>[]</tt>. The reason we use this is that for our composition, <tt>(->)</tt> acts a lot like the identity. In particular <tt>Composition (->) p a b</tt> is isomorphic to <tt>p a b</tt> (modulo all the usual stuff about non-terminating computations and so on). This is because an element of this type is a pair consisting of a function <tt>a -> x</tt> and a heteromorphism <tt>p x b</tt> for some type <tt>x</tt> we don't get to see. We can't project back out either of these items without information about the type of <tt>x</tt> escaping. So the only thing we can possibly do is use <tt>lmap</tt> to apply the function to the heteromorphism giving us an element of <tt>p a b</tt>. <P><BR>Here is a special case of <tt>PureP</tt> we'll use later: <P><BR><pre><br />> nil :: Profunctor p => FreeA p a a<br />> nil = PureP id<br /><P><BR><br /></pre>So an element of <tt>FreeA</tt> is a sequence of heteromorphisms. If heteromorphisms are thought of as operations of some sort, then an element of <tt>FreeA</tt> is a sequence of operations waiting to be composed together into a program that does something. And that's just like the situation with free monads. Once we've build a free monad structure we apply an interpreter to it to evaluate it. This allows us to separate the "pure" structure representing what we want to do from the code that actually does it. <P><BR>The first thing to note is our new type is also a profunctor. We can apply <tt>lmap</tt> and <tt>rmap</tt> to a <tt>PureP</tt> function straightforwardly. We apply <tt>lmap</tt> directly to the head of the list and we use recursion to apply <tt>rmap</tt> to the <tt>PureP</tt> at the end: <P><BR><pre><br />> instance Profunctor b => Profunctor (FreeA b) where<br />> lmap f (PureP g) = PureP (g . f)<br />> lmap f (g :- h) = (lmap f g) :- h<br />> rmap f (PureP g) = PureP (f . g)<br />> rmap f (g :- h) = g :- (rmap f h)<br /><P><BR><br /></pre>We also get a strong profunctor by applying <tt>first'</tt> all the way down the list: <P><BR><pre><br />> instance Strong p => Strong (FreeA p) where<br />> first' (PureP f) = PureP (first' f)<br />> first' (f :- g) = (first' f) :- (first' g)<br /><P><BR><br /></pre>We can now concatenate our lists of heteromorphisms using code that looks a lot like the typical implementation of <tt>(++)</tt>: <P><BR><pre><br />> instance Profunctor p => Category (FreeA p) where<br />> id = PureP id<br />> g . PureP f = lmap f g<br />> k . (g :- h) = g :- (k . h)<br /><P><BR><br /></pre>Note that it's slightly different to what you might have expected compared to <tt>(++)</tt> because we tend to write composition of functions "backwards". Additionally, there is another definition of <tt>FreeA</tt> we could have used that's analogous to using snoc lists instead of cons lists. <P><BR>And now we have an arrow. I'll leave the proofs that the arrow laws are obeyed as an exercise :-) <P><BR><pre><br />> instance (Profunctor p, Strong p) => Arrow (FreeA p) where<br />> arr = PureP<br />> first = first'<br /><P><BR><br /></pre>The important thing about free things is that we can apply interpreters to them. For lists we have folds: <P><BR><pre><br />foldr :: (a -> b -> b) -> b -> [a] -> b<br /><P><BR><br /></pre>In <tt>foldr f e</tt> we can think of <tt>f</tt> as saying how <tt>(:)</tt> should be interpreted and <tt>e</tt> as saying how <tt>[]</tt> should be interpreted. <P><BR>Analogously, in <tt>Control.Monad.Free</tt> in the <tt>free</tt> package we have: <P><BR><pre><br />foldFree :: Monad m => (forall x . f x -> m x) -> Free f a -> m a<br />foldFree _ (Pure a) = return a<br />foldFree f (Free as) = f as >>= foldFree f<br /><P><BR><br /></pre>Given a natural transformation from <tt>f</tt> to <tt>m</tt>, <tt>foldFree</tt> extends it to all of <tt>Free f</tt>. <P><BR>Now we need a fold for free arrows: <P><BR><pre><br />> foldFreeA :: (Profunctor p, Arrow a) =><br />> (forall b c.p b c -> a b c) -> FreeA p b c -> a b c<br />> foldFreeA _ (PureP g) = arr g<br />> foldFreeA f (g :- h) = foldFreeA f h . f g<br /><P><BR><br /></pre>It's a lot like an ordinary fold but uses the arrow composition law to combine the interpretation of the head with the interpretation of the tail. <P><BR><BR><b>"Electronic" components</b><p>Let me revisit the example from my previous <a href="http://blog.sigfpe.com/2017/01/addressing-pieces-of-state-with.html">article</a>. I'm going to remove things I won't need so my definition of <tt>Circuit</tt> is less general here. Free arrows are going to allow us to define individual components for a circuit, but defer exactly how those components are interpreted until later. <P><BR>I'll use four components this time: a register we can read from, one we can write from and a register incrementer, as well as a "pure" component. But before that, let's revisit Gabriel's article that gives some clues about how components should be built. In particular, look at the definition of <tt>TeletypeF</tt>: <P><BR><pre><br />data TeletypeF x<br /> = PutStrLn String x<br /> | GetLine (String -> x)<br /> | ExitSuccess<br /><P><BR><br /></pre>We use <tt>GetLine</tt> to read a string, and yet the type of <tt>GetLine k</tt> could be <tt>TeletypeF a</tt> for any <tt>a</tt>. The reason is that free monads work with continuations. Instead of <tt>GetLine</tt> returning a string to us, it's a holder for a function that says what we'd like to do with the string once we have it. That means we can leave open the question of where the string comes from. The function <tt>foldFree</tt> can be used to provide the actual string getter. <P><BR>Free arrows are like "two-sided" free monads. We don't just provide a continuation saying what we'd like to do to our output. We also get to say how we prepare our data for input. <P><BR>There's also some burden put on us. Free arrows need strong profunctors. Strong profunctors need to be able to convey extra data alongside the data we care about - that's what <tt>first'</tt> is all about. This means that even though <tt>Load</tt> is functionally similar to <tt>GetLine</tt>, it can't simply ignore its input. So we don't have <tt>Load (Int -> b)</tt>, and instead have <tt>Load ((a, Int) -> b</tt>. Here is our component type: <P><BR><pre><br />> data Component a b = Load ((a, Int) -> b)<br />> | Store (a -> (b, Int))<br />> | Inc (a -> b)<br /><P><BR><br /></pre>The <tt>Component</tt> only knows about the data passing through, of type <tt>a</tt> and <tt>b</tt>. It doesn't know anything about how the data in the registers is stored. That's the part that will be deferred to later. We intend for <tt>Inc</tt> to increment a register. But as it doesn't know anything about registers nothing in the type of <tt>Inc</tt> refers to that. (It took a bit of experimentation for me to figure this out and there may be other ways of doing things. Often with code guided by category theory you can just "follow your nose" as there's one way that works and type checks. Here I found a certain amount of flexibility in how much you store in the <tt>Component</tt> and how much is deferred to the interpreter.) <P><BR>I could implement the strong profunctor instances using various combinators but I think it might be easiest to understand when written explicitly with lambdas: <P><BR><pre><br />> instance Profunctor Component where<br />> lmap f (Load g) = Load $ \(a, s) -> g (f a, s)<br />> lmap f (Store g) = Store (g . f)<br />> lmap f (Inc g) = Inc (g . f)<br /><P><BR><br />> rmap f (Load g) = Load (f . g)<br />> rmap f (Store g) = Store $ \a -> let (b, t) = g a<br />> in (f b, t)<br />> rmap f (Inc g) = Inc (f . g)<br /><P><BR><br />> instance Strong Component where<br />> first' (Load g) = Load $ \((a, x), s) -> (g (a, s), x)<br />> first' (Store g) = Store $ \(a, x) -> let (b, t) = g a<br />> in ((b, x), t)<br />> first' (Inc g) = Inc (first' g)<br /><P><BR><br /></pre>And now we can implement individual components. First a completely "pure" component: <P><BR><pre><br />> add :: Num a => FreeA Component (a, a) a<br />> add = PureP $ uncurry (+)<br /><P><BR><br /></pre>And now the load and store operations. <P><BR><pre><br />> load :: FreeA Component () Int<br />> load = Load (\(_, a) -> a) :- nil<br /><P><BR><br />> store :: FreeA Component Int ()<br />> store = Store (\a -> ((), a)) :- nil<br /><P><BR><br />> inc :: FreeA Component a a<br />> inc = Inc id :- nil<br /><P><BR><br /></pre>Finally we can tie it all together in a complete function using arrow notation: <P><BR><pre><br />> test = proc () -> do<br />> () <- inc -< ()<br />> a <- load -< ()<br />> b <- load -< ()<br />> c <- add -< (a, b)<br />> () <- store -< c<br /><P><BR><br />> returnA -< ()<br /><P><BR><br /></pre>At this point, the <tt>test</tt> object is just a list of operations waiting to be executed. Now I'll give three examples of semantics we could provide. The first uses a state arrow type similar to the previous article: <P><BR><pre><br />> newtype Circuit s a b = C { runC :: (a, s) -> (b, s) }<br /><P><BR><br />> instance Category (Circuit s) where<br />> id = C id<br />> C f . C g = C (f . g)<br /><P><BR><br />> instance Arrow (Circuit s) where<br />> arr f = C $ \(a, s) -> (f a, s)<br />> first (C g) = C $ \((a, x), s) -> let (b, t) = g (a, s)<br />> in ((b, x), t)<br /><P><BR><br /></pre>Here is an interpreter that interprets each of our components as an arrow. Note that this is where, among other things, we provide the meaning of the <tt>Inc</tt> operation: <P><BR><pre><br />> exec :: Component a b -> Circuit Int a b<br />> exec (Load g) = C $ \(a, s) -> (g (a, s), s)<br />> exec (Store g) = C $ \(a, _) -> g a<br />> exec (Inc g) = C $ \(a, s) -> (g a, s+1)<br /><P><BR><br /></pre>Here's a completely different interpreter that is going to make <i>you</i> do the work of maintaining the state used by the resgisters. You'll be told what to do! We'll use the <tt>Kleisli IO</tt> arrow to do the I/O. <P><BR><pre><br />> exec' :: Component a b -> Kleisli IO a b<br />> exec' (Load g) = Kleisli $ \a -> do<br />> putStrLn "What is your number now?"<br />> s <- fmap read getLine<br />> return $ g (a, s)<br />> exec' (Store g) = Kleisli $ \a -> do<br />> let (b, t) = g a<br />> putStrLn $ "Your number is now " ++ show t ++ "."<br />> return b<br />> exec' (Inc g) = Kleisli $ \a -> do<br />> putStrLn "Increment your number."<br />> return $ g a<br /><P><BR><br /></pre>The last interpreter is simply going to sum values associated to various components. They could be costs in dollars, time to execute, or even strings representing some kind of simple execution trace. <P><BR><pre><br />> newtype Labelled m a b = Labelled { unLabelled :: m }<br /><P><BR><br />> instance Monoid m => Category (Labelled m) where<br />> id = Labelled mempty<br />> Labelled a . Labelled b = Labelled (a `mappend` b)<br /><P><BR><br />> instance Monoid m => Arrow (Labelled m) where<br />> arr _ = Labelled mempty<br />> first (Labelled m) = Labelled m<br /><P><BR><br />> exec'' (Load _) = Labelled (Sum 1)<br />> exec'' (Store _) = Labelled (Sum 1)<br />> exec'' (Inc _) = Labelled (Sum 2)<br /><P><BR><br /></pre>Note that we can't assign non-trivial values to "pure" operations. <P><BR>And now we execute all three: <P><BR><pre><br />> main = do<br />> print $ runC (foldFreeA exec test) ((), 10)<br />> putStrLn "Your number is 10." >> runKleisli (foldFreeA exec' test) ()<br />> print $ getSum $ unLabelled $ foldFreeA exec'' test<br /><P><BR><br /></pre><BR><b>Various thoughts</b><p>I don't know if free arrows are anywhere near as useful as free monads, but I hope I've successfully illustrated one application. Note that because arrow composition is essentially list concatenation it may be more efficient to use a version of <a href="http://www.cs.tufts.edu/~nr/cs257/archive/john-hughes/lists.pdf">Hughes lists</a>. This is what the Cayley representation is about in the <a href="https://arxiv.org/abs/1406.4823">monoid notions paper</a>. But it's easier to see the naive list version first. Something missing from here that is essential for electronics simulation is the possibility of using loops. I haven't yet thought too much about what it means to build instances of <tt>ArrowLoop</tt> freely. <P><BR>Profunctors have been described as decategorised matrices in the sense that <tt>p a b</tt>, with <tt>p</tt> a profunctor, is similar to the matrix <img src="https://chart.googleapis.com/chart?cht=tx&chl=P_%7Bab%7D" style="vertical-align:middle">. Or, if you're working in a context where you distinguish between co- and contravariant vectors, it's similar to <img src="https://chart.googleapis.com/chart?cht=tx&chl=P_a%5Eb" style="vertical-align:middle">. The <tt>Composition</tt> operation is a lot like the definition of matrix product. From this perspective, the <tt>FreeA</tt> operation is a lot like the function on matrices that takes <img src="https://chart.googleapis.com/chart?cht=tx&chl=P" style="vertical-align:middle"> to <img src="https://chart.googleapis.com/chart?cht=tx&chl=%281-P%29%5E%7B-1%7D" style="vertical-align:middle">. To work with <tt>ArrowLoop</tt> we need a trace-like operation. <P><BR>One nice application of free monads is in writing plugin APIs. Users can write plugins that link to a small library based on a free monad. These can then be dynamically loaded and interpreted by an application at runtime, completely insulating the plugin-writer from the details of the application. You can think of it as a Haskell version of the <a href="http://wiki.c2.com/?PimplIdiom">PIMPL idiom</a>. Free arrows might give a nice way to write plugins for dataflow applications. <P><BR>People typically think of functors as containers. So in a free monad, each element is a container of possible futures. In a free arrow the relationship between the current heteromorphism and its "future" (and "past") is a bit more symmetrical. For example, for some definitions of <tt>P</tt>, a heteromorphism <tt>P a b</tt> can act on some <tt>a</tt>s to give us some <tt>b</tt>s. But some definitions of <tt>P</tt> can run "backwards" and act on elements of <tt>b -> r</tt> to give us elements of <tt>a -> r</tt>. So when I use the words "input" and "output" above, you might not want to take them too literally. sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com10tag:blogger.com,1999:blog-11295132.post-12746724578568533832017-01-07T13:46:00.000-08:002017-01-07T13:46:44.073-08:00Addressing Pieces of State with Profunctors<BR><b>Attempted segue</b><p>Since I first wrote about <a href="http://blog.sigfpe.com/2011/07/profunctors-in-haskell.html">profunctors</a> there has been quite a bit of activity in the area so I think it's about time I revisited them. I could just carry on from where I left off 5 years ago but there have been so many tutorials on the subject that I think I'll have to assume you've looked at them. My favourite is probably Phil Freeman's <a href="https://www.youtube.com/watch?v=OJtGECfksds">Fun with Profunctors</a>. What I intend to do here is solve a practical problem with profunctors. <P><BR><BR><b>The problem</b><p><a href="https://www.haskell.org/arrows/">Arrows</a> are a nice mechanism for building circuit-like entities in code. In fact, they're quite good for simulating electronic circuits. Many circuits are very much like pieces of functional code. For example an AND gate like this <pre><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/--Ar1nA4cFDg/WHERrmCLLrI/AAAAAAAAxqY/e7-VbCVnRkIGQfelausx_IWszy7lCYhQgCLcB/s1600/and.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/--Ar1nA4cFDg/WHERrmCLLrI/AAAAAAAAxqY/e7-VbCVnRkIGQfelausx_IWszy7lCYhQgCLcB/s320/and.png" width="320" height="119" /></a></div><br /></pre>can be nicely modelled using a pure function: <tt>c = a && b</tt>. But some components, like flip-flops, have internal state. What comes out of the outputs isn't a simple function of the inputs right now, but depends on what has happened in the past. (Alternatively you can take the view that the inputs and outputs aren't the current values but the complete history of the values.) <P><BR>We'll use (Hughes) arrows rather than simple functions. For example, one kind of arrow is the <a href="https://wiki.haskell.org/Arrow_tutorial#Kleisli_Arrows">Kleisli arrow</a>. For the case of Kleisli arrows built from the state monad, these are essentially functions of type <tt>a -> s -> (b, s)</tt> where <tt>s</tt> is our state. We can write these more symmetrically as functions of type <tt>(a, s) -> (b, s)</tt>. We can think of these as "functions" from <tt>a</tt> to <tt>b</tt> where the output is allowed to depend on some internal state <tt>s</tt>. I'll just go ahead and define arrows like this right now. <P><BR>First the extensions and imports: <P><BR><pre><br />> {-# OPTIONS -W #-}<br />> {-# LANGUAGE Arrows #-}<br />> {-# LANGUAGE RankNTypes #-}<br />> {-# LANGUAGE FlexibleInstances #-}<br /><P><BR><br />> import Prelude hiding ((.), id)<br />> import Control.Arrow<br />> import Control.Category<br />> import Data.Profunctor<br />> import Data.Tuple<br /><P><BR><br /></pre>And now I'll define our stateful circuits. I'm going to make these slightly more general than I described allowing circuits to change the type of their state: <P><BR><pre><br />> newtype Circuit s t a b = C { runC :: (a, s) -> (b, t) }<br /><P><BR><br />> instance Category (Circuit s s) where<br />> id = C id<br />> C f . C g = C (f . g)<br /><P><BR><br />> instance Arrow (Circuit s s) where<br />> arr f = C $ \(a, s) -> (f a, s)<br />> first (C g) = C $ \((a, x), s) -> let (b, t) = g (a, s)<br />> in ((b, x), t)<br /><P><BR><br /></pre>This is just a more symmetrical rewrite of the state monad as an arrow. The <tt>first</tt> method allows us to pass through some extra state, <tt>x</tt>, untouched. <P><BR>Now for some circuit components. First the "pure" operations, a multiplier and a negater: <P><BR><pre><br />> mul :: Circuit s s (Int, Int) Int<br />> mul = C $ \((x, y), s) -> (x*y, s)<br /><P><BR><br />> neg :: Circuit s s Int Int<br />> neg = C $ \(x, s) -> (-x, s)<br /><P><BR><br /></pre>And now some "impure" ones that read and write some registers as well as an accumulator: <P><BR><pre><br />> store :: Circuit Int Int Int ()<br />> store = C $ \(x, _) -> ((), x)<br /><P><BR><br />> load :: Circuit Int Int () Int<br />> load = C $ \((), s) -> (s, s)<br /><P><BR><br />> accumulate :: Circuit Int Int Int Int<br />> accumulate = C $ \(a, s) -> (a, s+a)<br /><P><BR><br /></pre>I'd like to make a circuit that has lots of these components, each with its own state. I'd like to store all of these bits of state in a larger container. But that means that each of these components needs to have a way to address its own particular substate. That's the problem I'd like to solve. <P><BR><BR><b>Practical profunctor optics</b><p>In an alternative universe lenses were <a href="https://www.reddit.com/r/haskell/comments/1nwetz/lenses_that_work_with_arrows/ccmq2ka/?st=ixklzuz3&sh=cb40d2f3">defined using profunctors</a>. To find out more I recommend Phil Freeman's talk that I linked to above. Most of the next paragraph is just a reminder of what he says in that talk and I'm going to use the bare minimum to do the job I want. <P><BR>Remember that one of the things lenses allow you to do is this: suppose we have a record <tt>s</tt> containing a field of type <tt>a</tt> and another similar enough kind of record <tt>t</tt> with a field of type <tt>b</tt>. Among other things, a lens gives a way to take a rule for modifying the <tt>a</tt> field to a <tt>b</tt> field and extend it to a way to modify the <tt>s</tt> record into a <tt>t</tt> record. So we can think of lenses as giving us functions of type <tt>(a -> b) -> (s -> t)</tt>. Now if <tt>p</tt> is a profunctor then you can think of <tt>p a b</tt> as being a bit function-like. Like functions, profunctors typically (kinda, sorta) get used to consume (zero or more) objects of type <tt>a</tt> and output (zero or more) objects of type <tt>b</tt>. So it makes sense to ask our lenses to work with these more general objects too, i.e. we'd like to be able to get something of type <tt>p a b -> p s t</tt> out of a lens. A strong profunctor is one that comes pre-packed with a lens that can do this for the special case where the types <tt>s</tt> and <tt>t</tt> are 2-tuples. But you can think of simple records as being syntactic sugar for tuples of fields, so strong profunctors also automatically give us lenses for records. Again, watch Phil's talk for details. <P><BR>So here is our lens type: <P><BR><pre><br />> type Lens s t a b = forall p. Strong p => p a b -> p s t<br /><P><BR><br /></pre>Here are lenses that mimic the well known ones from <tt>Control.Lens</tt>: <P><BR><pre><br />> _1 :: Lens (a, x) (b, x) a b<br />> _1 = first'<br /><P><BR><br />> _2 :: Lens (x, a) (x, b) a b<br />> _2 = dimap swap swap . first'<br /><P><BR><br /></pre>(Remember that <tt>dimap</tt> is a function to pre- and post- compose a function with two others.) <P><BR><a href="https://hackage.haskell.org/package/profunctors-3.3.0.1/docs/Data-Profunctor.html">Arrows are profunctors</a>. So <tt>Circuit s s</tt>, when wrapped in <tt>WrappedArrow</tt>, is a profunctor. So now we can directly use the <tt>Circuit</tt> type with profunctor lenses. This is cool, but it doesn't directly solve our problem. So we're not going to use this fact. We're interested in addressing the state of type <tt>s</tt>, not the values of type <tt>a</tt> and <tt>b</tt> passed through our circuits. In other words, we're interested in the fact that <tt>Circuit s t a b</tt> is a profunctor in <tt>s</tt> and <tt>t</tt>, not <tt>a</tt> and <tt>b</tt>. To make this explicit we need a suitable way to permute the arguments to <tt>Circuit</tt>: <P><BR><pre><br />> newtype Flipped p s t a b = F { unF :: p a b s t }<br /><P><BR><br /></pre>(It was <a href="https://hackage.haskell.org/package/bifunctors-3.2.0.1/docs/Data-Bifunctor-Clown.html">tempting</a> to call that <tt>ComedyDoubleAct</tt>.) <P><BR>And now we can define: <P><BR><pre><br />> instance Profunctor (Flipped Circuit a b) where<br />> lmap f (F (C g)) = F $ C $ \(a, s) -> g (a, f s)<br />> rmap f (F (C g)) = F $ C $ \(a, s) -> let (b, t) = g (a, s)<br />> in (b, f t)<br /><P><BR><br />> instance Strong (Flipped Circuit a b) where<br />> first' (F (C g)) = F $ C $ \(a, (s, x)) -> let (b, t) = g (a, s)<br />> in (b, (t, x))<br /><P><BR><br /></pre>Any time we want to use this instance of <tt>Profunctor</tt> with a <tt>Circuit</tt> we have to wrap everything with <tt>F</tt> and <tt>unF</tt>. The function <tt>dimap</tt> gives us a convenient way to implement such wrappings. <P><BR>Let's implement an imaginary circuit with four bits of state in it. <pre><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-v8mhMGIcJmQ/WHESx2RmexI/AAAAAAAAxqg/qGLZOzp1JO4eDLVamnucTJA2Q6jSSHWTQCLcB/s1600/circuit.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-v8mhMGIcJmQ/WHESx2RmexI/AAAAAAAAxqg/qGLZOzp1JO4eDLVamnucTJA2Q6jSSHWTQCLcB/s320/circuit.png" width="480" height="166" /></a></div><br /></pre>Here is the state: <P><BR><pre><br />> data CPU = CPU { _x :: Int, _y :: Int, _z :: Int, _t :: Int } deriving Show<br /><P><BR><br /></pre>As I don't have a complete profunctor version of a library like <tt>Control.Lens</tt> with its template Haskell magic I'll set things up by hand. Here's a strong-profunctor-friendly version of the <tt>CPU</tt> and a useful isomorphism to go with it: <P><BR><pre><br />> type ExplodedCPU = (Int, (Int, (Int, Int)))<br /><P><BR><br />> explode :: CPU -> ExplodedCPU<br />> explode (CPU u v w t) = (u, (v, (w, t)))<br /><P><BR><br />> implode :: ExplodedCPU -> CPU<br />> implode (u, (v, (w, t))) = CPU u v w t<br /><P><BR><br /></pre>And now we need adapters that take lenses for an <tt>ExplodedCPU</tt> and (1) apply them to a CPU the way <tt>Control.Lens</tt> would... <P><BR><pre><br />> upgrade :: Profunctor p =><br />> (p a a -> p ExplodedCPU ExplodedCPU) -><br />> (p a a -> p CPU CPU)<br />> upgrade f = dimap explode implode . f<br /><P><BR><br />> x, y, z, t :: Flipped Circuit a b Int Int -> Flipped Circuit a b CPU CPU<br />> x = upgrade _1<br />> y = upgrade $ _2 . _1<br />> z = upgrade $ _2 . _2 . _1<br />> t = upgrade $ _2 . _2 . _2<br /><P><BR><br /></pre>...and (2) wrap them so they can be used on the flipped profunctor instance of <tt>Circuit</tt>: <P><BR><pre><br />> (!) :: p s t a b -> (Flipped p a b s t -> Flipped p a b s' t') -><br />> p s' t' a b<br />> x ! f = dimap F unF f x<br /><P><BR><br /></pre>After all that we can now write a short piece of code that represents our circuit. Notice how we can apply the lenses <tt>x, ..., t</tt> directly to our components to get them to use the right pieces of state: <P><BR><pre><br />> test :: Circuit CPU CPU () ()<br />> test = proc () -> do<br />> a <- load ! x -< ()<br />> b <- load ! y -< ()<br />> c <- mul -< (a, b)<br />> d <- neg -< c<br />> e <- accumulate ! t -< d<br />> () <- store ! z -< e<br /><P><BR><br />> returnA -< ()<br /><P><BR><br />> main :: IO ()<br />> main = do<br />> print $ runC test ((), CPU 2 30 400 5000)<br /><P><BR><br /></pre>Of course with a suitable profunctor lens library you can do a lot more, like work with traversable containers of components. <P><BR>Note that we could also write a version of all this code using monads instead of arrows. But it's easier to see the symmetry in <tt>Flipped Circuit</tt> when using arrows, and it also sets the scene for the next thing I want to write about... sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com0tag:blogger.com,1999:blog-11295132.post-7517128154540577622016-10-16T16:04:00.000-07:002016-10-18T06:59:25.884-07:00Expectation-Maximization with Less Arbitrariness<b>Introduction</b><p>There are many introductions to the Expectation-Maximisation algorithm. Unfortunately every one I could find uses arbitrary seeming tricks that seem to be plucked out of a hat by magic. They can all be justified in retrospect, but I find it more useful to learn from reusable techniques that you can apply to further problems. Examples of tricks I've seen used are: <ol><li>Using Jensen's inequality. It's easy to find inequalities that apply in any situation. But there are often many ways to apply them. Why apply it to <i>this</i> way of writing this expression and not that one which is equal?</li><li>Substituting <img src="https://chart.googleapis.com/chart?cht=tx&chl=1%3DA%2FA" style="vertical-align:middle"> in the middle of an expression. Again, you can use <img src="https://chart.googleapis.com/chart?cht=tx&chl=1%3DA%2FA" style="vertical-align:middle"> just about anywhere. Why choose this <img src="https://chart.googleapis.com/chart?cht=tx&chl=A" style="vertical-align:middle"> at this time? Similarly I found derivations that insert a <img src="https://chart.googleapis.com/chart?cht=tx&chl=B-B" style="vertical-align:middle"> into an expression.</li><li>Majorisation-Minimisation. This is a great technique, but involves choosing a function that majorises another. There are so many ways to do this, it's hard to imagine any general purpose method that tells you how to narrow down the choice.</li></ol>My goal is to fill in the details of one key step in the derivation of the EM algorithm in a way that makes it inevitable rather than arbitrary. There's nothing original here, I'm merely expanding on a <a href="http://stats.stackexchange.com/questions/44513/the-relationship-between-expectation-maximization-and-majorization-minimization/59470#59470">stackexchange answer</a>. <P><BR><b>Generalities about EM</b><p>The EM algorithm seeks to construct a maximum likelihood estimator (MLE) with a twist: there are some variables in the system that we can't observe. <P><BR>First assume no hidden variables. We assume there is a vector of parameters <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctheta%3D%28%5Ctheta_i%29" style="vertical-align:middle"> that defines some model. We make some observations <img src="https://chart.googleapis.com/chart?cht=tx&chl=x%3D%28x_j%29" style="vertical-align:middle">. We have a probability density <img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28x%7C%5Ctheta%29" style="vertical-align:middle"> that depends on <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctheta" style="vertical-align:middle">. The likelihood of <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctheta" style="vertical-align:middle"> given the observations <img src="https://chart.googleapis.com/chart?cht=tx&chl=x" style="vertical-align:middle"> is <img src="https://chart.googleapis.com/chart?cht=tx&chl=l%28%5Ctheta%7Cx%29%3DP%28x%7C%5Ctheta%29" style="vertical-align:middle">. The maximum likelhood estimator for <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctheta" style="vertical-align:middle"> is the choice of <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctheta" style="vertical-align:middle"> that maximises <img src="https://chart.googleapis.com/chart?cht=tx&chl=l%28%5Ctheta%7Cx%29" style="vertical-align:middle"> for the <img src="https://chart.googleapis.com/chart?cht=tx&chl=x" style="vertical-align:middle"> we have observed. <P><BR>Now suppose there are also some variables <img src="https://chart.googleapis.com/chart?cht=tx&chl=z%3D%28z_k%29" style="vertical-align:middle"> that we didn't get to observe. We assume a density <img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28x%2Cz%7C%5Ctheta%29" style="vertical-align:middle">. We now have <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28x%7C%5Ctheta%29%3D%5Csum_z%20P%28x%2Cz%7C%5Ctheta%29" style="vertical-align:middle"></blockquote>where we sum over all possible values of <img src="https://chart.googleapis.com/chart?cht=tx&chl=z" style="vertical-align:middle">. The MLE approach says we now need to maximise <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=l%28%5Ctheta%7Cx%29%3D%5Csum_z%20P%28x%2Cz%7C%5Ctheta%29." style="vertical-align:middle"></blockquote>One of the things that is a challenge here is that the components of <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctheta" style="vertical-align:middle"> might be mixed up among the terms in the sum. If, instead, each term only referred to its own unique block of <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctheta_i" style="vertical-align:middle">, then the maximisation would be easier as we could maximise each term independently of the others. Here's how we might move in that direction. Consider instead the log-likelihood <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Clog%20l%28%5Ctheta%7Cx%29%3D%5Clog%5Csum_z%20P%28x%2Cz%7C%5Ctheta%29." style="vertical-align:middle"></blockquote>Now imagine that by magic we could commute the logarithm with the sum. We'd need to maximise <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Csum_z%20%5Clog%20P%28x%2Cz%7C%5Ctheta%29." style="vertical-align:middle"></blockquote>One reason this would be to our advantage is that <img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28x%2Cz%7C%5Ctheta%29" style="vertical-align:middle"> often takes the form <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cexp%28f%28x%2Cz%2C%5Ctheta%29%29" style="vertical-align:middle"> where <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle"> is a simple function to optimise. In addition, <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle"> may break up as a sum of terms, each with its own block of <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctheta_i" style="vertical-align:middle">'s. Moving the logarithm inside the sum would give us something we could easily maximise term by term. What's more, the <img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28x%2Cz%7C%5Ctheta%29" style="vertical-align:middle"> for each <img src="https://chart.googleapis.com/chart?cht=tx&chl=z" style="vertical-align:middle"> is often a standard probability distribution whose likelihood we already know how to maximise. But, of course, we can't just move that logarithm in. <P><BR><b>Maximisation by proxy</b><p>Sometimes a function is too hard to optimise directly. But if we have a guess for an optimum, we can replace our function with a proxy function that approximates it in the neighbourhood of our guess and optimise that instead. That will give us a new guess and we can continue from there. This is the basis of gradient descent. Suppose <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle"> is a differentiable function in a neighbourhood of <img src="https://chart.googleapis.com/chart?cht=tx&chl=x_0" style="vertical-align:middle">. Then around <img src="https://chart.googleapis.com/chart?cht=tx&chl=x_0" style="vertical-align:middle"> we have <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=f%28x%29%20%5Capprox%20f%28x_0%29%2Bf%27%28x_0%29%5Ccdot%20%28x-x_0%29." style="vertical-align:middle"></blockquote>We can try optimising <img src="https://chart.googleapis.com/chart?cht=tx&chl=f%28x_0%29%2Bf%27%28x_0%29%5Ccdot%20%28x-x_0%29" style="vertical-align:middle"> with respect to <img src="https://chart.googleapis.com/chart?cht=tx&chl=x" style="vertical-align:middle"> within a neighbourhood of <img src="https://chart.googleapis.com/chart?cht=tx&chl=x_0" style="vertical-align:middle">. If we pick a small circular neighbourhood then the optimal value will be in the direction of steepest descent. (Note that picking a circular neighbourhood is itself a somewhat arbitrary step, but that's another story.) For gradient descent we're choosing <img src="https://chart.googleapis.com/chart?cht=tx&chl=f%28x_0%29%2Bf%27%28x_0%29%5Ccdot%20%28x-x_0%29" style="vertical-align:middle"> because it matches both the value and derivatives of <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle"> at <img src="https://chart.googleapis.com/chart?cht=tx&chl=x_0" style="vertical-align:middle">. We could go further and optimise a proxy that shares second derivatives too, and that leads to methods based on Newton-Raphson iteration. <P><BR>We want our logarithm of a sum to be a sum of logarithms. But instead we'll settle for a proxy function that is a sum of logarithms. We'll make the derivatives of the proxy match those of the original function precisely so we're not making an arbitrary choice. <P><BR>Write <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Clog%20l%28%5Ctheta%7Cx%29%0A%3D%20%5Clog%5Csum_z%20P%28x%2Cz%7C%5Ctheta%29%0A%5Capprox%20%5Csum_z%5Cbeta_z%5Clog%20P%28x%2Cz%7C%5Ctheta%29%2B%5Cmbox%7Bconstant%7D." style="vertical-align:middle"></blockquote>The <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cbeta_z" style="vertical-align:middle"> are constants we'll determine. We want to match the derivatives on either side of the <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Capprox" style="vertical-align:middle">at <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctheta%3D%5Ctheta_0" style="vertical-align:middle">: <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7B%5Cpartial%20%5Clog%20l%28%5Ctheta_0%7Cx%29%7D%7B%5Cpartial%5Ctheta_0%7D" style="vertical-align:middle"> <img src="https://chart.googleapis.com/chart?cht=tx&chl=%3D%5Cfrac%7B1%7D%7Bl%28%5Ctheta_0%7Cx%29%7D%20%5Cfrac%7B%5Cpartial%20l%28%5Ctheta_0%7Cx%29%7D%7B%5Cpartial%5Ctheta_0%7D%20%3D%5Csum_z%5Cfrac%7B1%7D%7Bl%28%5Ctheta_0%7Cx%29%7D%20%5Cfrac%7B%5Cpartial%20P%28x%2Cz%7C%5Ctheta_0%29%7D%7B%5Cpartial%5Ctheta_0%7D." style="vertical-align:middle"></blockquote>On the other hand we have <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%0A%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%5Ctheta_0%7D%5Csum_z%5Cbeta_z%5Clog%20P%28x%2Cz%7C%5Ctheta_0%29%0A%3D%5Csum_z%5Cbeta_z%5Cfrac%7B1%7D%7BP%28x%2Cz%7C%5Ctheta_0%29%7D%5Cfrac%7B%5Cpartial%20P%28x%2Cz%7C%5Ctheta_0%29%7D%7B%5Cpartial%5Ctheta_0%7D%0A" style="vertical-align:middle"></blockquote><P><BR>To achieve equality we want to make these expressions match. We choose <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%0A%5Cbeta_z%20%3D%20%5Cfrac%7BP%28x%2Cz%7C%5Ctheta_0%29%7D%7Bl%28%5Ctheta_0%7Cx%29%7D%0A%3D%20%5Cfrac%7BP%28x%2Cz%7C%5Ctheta_0%29%7D%7BP%28x%7C%5Ctheta_0%29%7D%0A%3D%20P%28z%7Cx%2C%5Ctheta_0%29.%0A" style="vertical-align:middle"></blockquote>Our desired proxy function is: <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%0A%5Csum_z%20P%28z%7Cx%2C%5Ctheta_0%29%5Clog%20P%28x%2Cz%7C%5Ctheta%29%2B%5Cmbox%7Bconst.%7D%0A%3D%20E_%7BZ%7Cx%2C%5Ctheta_0%7D%28%5Clog%20P%28x%2CZ%7C%5Ctheta%29%29%2B%5Cmbox%7Bconst.%7D%0A" style="vertical-align:middle"></blockquote><P><BR>So the procedure is to take an estimated <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctheta_0" style="vertical-align:middle"> and obtain a new estimate by optimising this proxy function with respect to <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ctheta" style="vertical-align:middle">. This is the standard EM algorithm. <P><BR>It turns out that this proxy has some other useful properties. For example, because of the concavity of the logarithm, the proxy is always smaller than the original likelihood. This means that when we optimise it we never optimise ``too far'' and that progress optimising the proxy is always progress optimising the original likelihood. But I don't need to say anything about this as it's all part of the standard literature. <P><BR><b>Afterword</b><p>As a side effect we have a general purpose optimisation algorithm that has nothing to do with statistics. If your goal is to compute <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%0A%5Cmbox%7Bargmax%7D_x%5Csum_i%5Cexp%28f_i%28x%29%29%0A" style="vertical-align:middle"></blockquote>you can iterate, at each step computing <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%0A%5Cmbox%7Bargmax%7D_x%5Csum_i%5Cexp%28f_i%28x_0%29%29f_i%28x%29%0A" style="vertical-align:middle"></blockquote>where <img src="https://chart.googleapis.com/chart?cht=tx&chl=x_0" style="vertical-align:middle"> is the previous iteration. If the <img src="https://chart.googleapis.com/chart?cht=tx&chl=f_i" style="vertical-align:middle"> take a convenient form then this may turn out to be much easier. <P><BR><b>Note</b><p>This was originally written as a PDF using LaTeX. It'll be available <a href="https://t.co/hyZrCiAh6h">here</a> for a while. Some fidelity was lost when converting it to HTML. sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com3tag:blogger.com,1999:blog-11295132.post-15935523875580312352016-08-06T19:23:00.000-07:002016-08-06T19:23:49.135-07:00Dimensionful Matrices<b>Introduction</b><p>Programming languages and libraries for numerical work tend not to place a lot of emphasis on the types of their data. For example Matlab, R, Octave, Fortran, and Numpy (but not the now defunct <a href="https://projectfortress.java.net">Fortress</a>) all tend to treat their data as plain numbers meaning that any time you have a temperature and a mass, say, there is nothing to prevent you adding them. <P><BR>I've been wondering how much dimensions (in the sense of <a href="https://en.wikipedia.org/wiki/Dimensional_analysis">dimensional analysis</a>) and units could help with numerical programming. As I pointed out on <a href="https://plus.google.com/+DanPiponi/posts/VNkdwwJL4LM">G+</a> recently (which is where I post shorter stuff these days), you don't have to limit dimensions to the standard ones of length, mass, time, dollars and so on. Any scale invariance in the equations you're working with can be exploited as a dimension giving you a property that can be statically checked by a compiler. <P><BR>There are quite a few libraries to statically check dimensions and units now. For example <a href="http://www.boost.org/doc/libs/1_61_0/doc/html/boost_units.html">Boost.Units</a> for C++, <a href="https://hackage.haskell.org/package/units">units</a> for Haskell and even <a href="https://github.com/timjb/quantities">quantities</a> for Idris. <P><BR><b>A matrix that breaks things</b><p>Even if a language supports dimensions, it's typical to define objects like vectors and matrices as homogeneous containers of quantities. But have a look at the Wikipedia page on the <a href="https://en.wikipedia.org/wiki/Metric_tensor_%28general_relativity%29">metric tensor</a>. There is a matrix <P><BR><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ceta%3D%5Cbegin%7Bpmatrix%7D-c%5E2%260%260%260%5C%5C0%261%260%260%5C%5C0%260%261%260%5C%5C0%260%260%261%5Cend%7Bpmatrix%7D" style="vertical-align:middle"><P><BR>which has the curious property that 3 entries on the diagonal seem to be dimensionless while the first entry is a squared velocity with dimension <img src="https://chart.googleapis.com/chart?cht=tx&chl=L%5E2T%5E%7B-2%7D" style="vertical-align:middle">. This will break many libraries that support units. An obvious workaround is to switch to use <a href="https://en.wikipedia.org/wiki/Natural_units">natural units</a>, which is much the same as abandoning the usefulness of dimensions. But there's another way, even if it may be tricky to set up with existing languages. <P><BR><b>Heterogeneous vectors and matrices</b><p>According to a common convention in physics, a 4-vector <img src="https://chart.googleapis.com/chart?cht=tx&chl=x%3D%28t%3Dx%5E0%2Cx%5E1%2Cx%5E2%2Cx%5E3%29" style="vertical-align:middle"> has dimensions <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Bx%5D%3D%28T%2C%20L%2C%20L%2C%20L%29" style="vertical-align:middle"> where I'm using the convention that we can represent the units of a vector or matrix simply as a vector or matrix of dimensions, and here <img src="https://chart.googleapis.com/chart?cht=tx&chl=T" style="vertical-align:middle"> is time and <img src="https://chart.googleapis.com/chart?cht=tx&chl=L" style="vertical-align:middle"> is length. The metric tensor is used like this: <img src="https://chart.googleapis.com/chart?cht=tx&chl=ds%5E2%3Dx%5Ei%5Ceta_%7Bij%7Dx%5Ej" style="vertical-align:middle"> (where I'm using the Einstein summation convention so the <img src="https://chart.googleapis.com/chart?cht=tx&chl=i" style="vertical-align:middle">'s and <img src="https://chart.googleapis.com/chart?cht=tx&chl=j" style="vertical-align:middle">'s are summed over). If we think of <img src="https://chart.googleapis.com/chart?cht=tx&chl=ds%5E2" style="vertical-align:middle"> having units of length squared (it is a pseudo-Riemannian <i>metric</i> after all) then it makes sense to think of <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Ceta" style="vertical-align:middle"> having dimensions given by <P><BR><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5B%5Ceta%5D%3D%5Cbegin%7Bpmatrix%7DL%5E2T%5E%7B-2%7D%26LT%5E%7B-1%7D%26LT%5E%7B-1%7D%26LT%5E%7B-1%7D%5C%5CLT%5E%7B-1%7D%261%261%261%5C%5CLT%5E%7B-1%7D%261%261%261%5C%5CLT%5E%7B-1%7D%261%261%261%5Cend%7Bpmatrix%7D" style="vertical-align:middle"><P><BR>We can write this more succinctly as <P><BR><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5B%5Ceta%5D%3D%28LT%5E%7B-1%7D%2C1%2C1%2C1%29%5Cotimes%28LT%5E%7B-1%7D%2C1%2C1%2C1%29" style="vertical-align:middle"><P><BR>where <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cotimes" style="vertical-align:middle"> is the usual outer product. <P><BR>I'll use the notation <img src="https://chart.googleapis.com/chart?cht=tx&chl=a%3AA" style="vertical-align:middle"> to mean <img src="https://chart.googleapis.com/chart?cht=tx&chl=a" style="vertical-align:middle"> is of type <img src="https://chart.googleapis.com/chart?cht=tx&chl=A" style="vertical-align:middle">. So, for example, <img src="https://chart.googleapis.com/chart?cht=tx&chl=%28t%2Cx%5E1%2Cx%5E2%2Cx%5E3%29%3A%28T%2CL%2CL%2CL%29" style="vertical-align:middle">. I'll also use pointwise notation for types such as <img src="https://chart.googleapis.com/chart?cht=tx&chl=%28A%2CB%2CC%2CD%29%2A%28E%2CF%2CG%2CH%29%3D%28AE%2CBF%2CCG%2CDH%29" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=%28A%2CB%2CC%29%5E%7B-1%7D%3D%28A%5E%7B-1%7D%2CB%5E%7B-1%7D%2CC%5E%7B-1%7D%29" style="vertical-align:middle">. <P><BR>Now I can give some general rules. If <img src="https://chart.googleapis.com/chart?cht=tx&chl=m%3AM" style="vertical-align:middle"> is a matrix, <img src="https://chart.googleapis.com/chart?cht=tx&chl=x%3AX" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=y%3AY" style="vertical-align:middle"> are vectors, and <img src="https://chart.googleapis.com/chart?cht=tx&chl=s%3AS" style="vertical-align:middle"> is a scalar, then <img src="https://chart.googleapis.com/chart?cht=tx&chl=y%3Dmx" style="vertical-align:middle"> only makes sense if <img src="https://chart.googleapis.com/chart?cht=tx&chl=M%3DY%5Cotimes%20X%5E%7B-1%7D" style="vertical-align:middle">. Similarly the "inner product" <img src="https://chart.googleapis.com/chart?cht=tx&chl=x%5ETmy%3Ds" style="vertical-align:middle"> only makes sense if <img src="https://chart.googleapis.com/chart?cht=tx&chl=M%3DX%5E%7B-1%7D%5Cotimes%20Y%5E%7B-1%7DS" style="vertical-align:middle">. <P><BR><b>Generic vectors and matrices</b><p>Although these kinds of types might be useful if you're dealing with the kind of heterogeneous matrices that appear in relativity, there's another reason they might be useful. If you write code (in the imaginary language that supports these structures and understands dimensions and units) to be as generic as possible in the types of the vector and matrix entries, failures to type check will point out parts of the code where there are hidden assumptions, or even errors, about scaling. For example, consider a routine to find the inverse of a 3 by 3 matrix. Writing this generically as possible means we should write it to operate on a matrix of type <img src="https://chart.googleapis.com/chart?cht=tx&chl=%28A%2CB%2CC%29%5Cotimes%28D%2CE%2CF%29" style="vertical-align:middle">, say. The result should have type <img src="https://chart.googleapis.com/chart?cht=tx&chl=%28D%2CE%2CF%29%5E%7B-1%7D%5Cotimes%28A%2CB%2CC%29%5E%7B-1%7D" style="vertical-align:middle">. If this type checks when used with a suitably powerful type checker then it means that if we replace the units for type A, say, with units twice as large, it should have no effect on the result, taking into account those units. In this case, it means that if we multiply the numbers of the first row of the input by 0.5 then the numbers of the first column of the output should get multiplied by 2. In fact this is a basic property of matrix inverses. In other words, this mathematical property of matrix inverses is guaranteed by a type system that can handle units and heterogeneous matrices. It would be impossible to write a matrix inverter that type checks and fails to have this property. Unfortunately it's still possible to write a matrix inverter that type checks and is incorrect some other way. Nonetheless this kind of type system would put a very big constraint on the code and is likely to eliminate many sources of error. <P><BR><b>An example, briefly sketched</b><p>I thought I'd look at an actual example of a matrix inverter to see what would happen if I used a type checker like the one I've described. I looked at the <a href="https://en.wikipedia.org/wiki/Conjugate_gradient_method">conjugate gradient method</a>. At the Wikipedia page, note the line <P><BR><img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Calpha_k%3A%3D%5Cfrac%7B%5Cmathbf%7Br%7D_k%5ET%5Cmathbf%7Br%7D_%7Bk%2B1%7D%7D%7B%5Cmathbf%7Bp%7D_k%5ET%5Cmathbf%7BAp%7D_k%7D" style="vertical-align:middle"><P><BR>This would immediately fail to type check because if <img src="https://chart.googleapis.com/chart?cht=tx&chl=r" style="vertical-align:middle"> is of generic vector type <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%3D%28R_1%2C%5Cldots%2CR_n%29" style="vertical-align:middle"> then <img src="https://chart.googleapis.com/chart?cht=tx&chl=R_1%5E2" style="vertical-align:middle"> isn't the same type as <img src="https://chart.googleapis.com/chart?cht=tx&chl=R_2%5E2" style="vertical-align:middle"> so they can't be added. I won't go into any of the details but the easiest way to patch up this code to make it type check is to introduce a new matrix <img src="https://chart.googleapis.com/chart?cht=tx&chl=P" style="vertical-align:middle"> of type <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%5E%7B-1%7D%5Cotimes%20R%5E%7B-1%7D" style="vertical-align:middle"> and besides using it to make this inner product work (replacing the numerator by <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cmathbf%7Br%7D_k%5ET%20P%5Cmathbf%7Br%7D_%7Bk%2B1%7D" style="vertical-align:middle">) we also use <img src="https://chart.googleapis.com/chart?cht=tx&chl=P" style="vertical-align:middle"> anywhere in the code we need to convert a vector of type <img src="https://chart.googleapis.com/chart?cht=tx&chl=R" style="vertical-align:middle"> to a vector of type <img src="https://chart.googleapis.com/chart?cht=tx&chl=R%5E%7B-1%7D" style="vertical-align:middle">. If you try to do this as sparingly as possible you'll end up with a modified algorithm. But at first this seems weird. Why should this matrix inverse routine rely on someone passing in a second matrix to make it type check? And what is this new algorithm anyway? Well scroll down the Wikipedia page and you get to the <i>preconditioned</i> conjugate gradient algorithm. The extra matrix we need to pass in is the preconditioner. This second algorithm would type check. Preconditioned conjugate gradient, with a suitable preconditioner, generally performs better than pure conjugate gradient. So in this case we're getting slightly more than a check on our code's correctness. The type checker for our imaginary language would give a hint on how to make the code perform better. There's a reason for this. The original conjugate gradient algorithm is implicitly making a choice of units that sets scales along the axes. These determine the course taken by the algorithm. It's not at all clear that picking these scalings randomly (which is in effect what you're doing if you throw a random problem at the algorithm) is any good. It's better to pick a preconditioner adapted to the scale of the problem and the type checker is hinting (or would be if it existed) that you need to do this. Compare with the gradient descent algorithm whose <a href="http://www.machinedlearnings.com/2011/06/dimensional-analysis-and-gradient.html">scaling problems are better known</a>. <P><BR><b>But which language?</b><p>I guess both Agda and Idris could be made to implement what I've described. However, I've a hunch it might not be easy to use in practice. sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com7tag:blogger.com,1999:blog-11295132.post-17971572217192575942014-05-23T22:21:00.000-07:002014-05-23T22:21:28.593-07:00Cofree meets Free<pre><br />> {-# LANGUAGE RankNTypes, MultiParamTypeClasses, TypeOperators #-}<br /><P><BR><br /></pre><b>Introduction</b><p>After I spoke at <a href="http://www.haskell.org/haskellwiki/BayHac2014">BayHac 2014</a> about free monads I was asked about cofree comonads. So this is intended as a sequel to that talk. Not only am I going to try to explain what cofree comonads are. I'm also going to point out a very close relationship between cofree comonads and free monads. <P><BR>At the beginning of the talk the Google Hangout software seems to have switched to the laptop camera so you can't see the slides in the <a href="https://www.youtube.com/watch?v=OGUuGL0AgYs">video</a>. However the slides are <a href="https://drive.google.com/file/d/0B51SFgxqMDS-NDBOX0ZDdW52dEE/edit">here</a>. <P><BR><b>Cothings as machines</b><p>I often think of coalgebraic things as machines. They have some internal state and you can press buttons to change that internal state. For example here is a type class for a machine with two buttons that's related to a magma: <P><BR><pre><br />> class TwoButton a where<br />> press :: a -> (a, a)<br /><P><BR><br /></pre>The idea is that the state of the machine is given by some type <tt>a</tt> and you could press either the left button or the right button. The result of pressing one or other button is given by these two functions: <P><BR><pre><br />> pressLeft, pressRight :: TwoButton a => a -> a<br />> pressLeft = fst . press<br />> pressRight = snd . press<br /><P><BR><br /></pre>(As with many metaphors used to explain Haskell type classes your mileage may vary. Sometimes you'll have to stretch your imagination to see what the set of buttons is for a particular cothing.) <P><BR><b>Comonads</b><p>Just as monads are a kind of generalised algebraic structure (for example see my talk), comonads are a generalised kind of machine. The idea is that for any state of the machine there is a bunch of buttons we could press. But we don't have two buttons, or any fixed number of buttons. We instead have a functorful of buttons (if you think of functors by analogy with containers). We also don't get to directly see the internal state of the machine but instead we get to make observations. <P><BR>Here's the type class: <P><BR><pre><br />> class Comonad w where<br />> extract :: w a -> a<br />> duplicate :: w a -> w (w a)<br /><P><BR><br /></pre>The state of the machine is given by <tt>w a</tt>. We observe the state using the <tt>extract</tt> function. And when we come to press a button, we have a functorful of new states that it could end up in. The <tt>duplicate</tt> function gives the container of those new states. <P><BR>For example, various kinds of zipper give rise to comonads. Zippers allow you to "focus" on a part of a data structure. The <tt>extract</tt> operation allows you to observe the point that currently has focus. There is one button for every position in the structure where the focus could be. Pressing the corresponding button moves the focus to that point. Similarly the <tt>Store</tt> comonad has one button for each value you can store in the field it represents. Press the button and the value gets stored in the field. <P><BR><b>Cofreeness as a way to memoise</b><p>Cofree coalgebras can be thought of as memoised forms of elements of coalgebras. For example, the <tt>TwoButton</tt> machine above has a function, <tt>press</tt>, as part of its definition. Memoising an element of such a thing means tabulating everything that could possibly happen if you pressed the buttons so we no longer need the <tt>press</tt> function. One approach is to try something like this: <P><BR><pre><br />data CofreeTwoButton = Memo CofreeTwoButton CofreeTwoButton<br /><P><BR><br /></pre>The structure contains two <tt>CofreeTwoButton</tt>s, each giving the result of pressing one of the two buttons. Any element of <tt>CofreeTwoButton</tt> may now be memoised like so: <P><BR><pre><br />memoiseTwoButton :: TwoButton m => m -> CofreeTwoButton<br />memoiseTwoButton m = Memo (memoiseTwoButton (pressLeft m)) (memoiseTwoButton (pressRight m))<br /><P><BR><br /></pre>It definitely tabulates the result of pressing buttons. But it has a major flaw. We have no way of seeing what's stored in the table! To make this useful we want to also store some data in the table that we can peek at. So here is a better definition: <P><BR><pre><br />> data CofreeTwoButton a = Memo a (CofreeTwoButton a) (CofreeTwoButton a)<br />> memoiseTwoButton :: TwoButton m => (m -> a) -> m -> CofreeTwoButton a<br />> memoiseTwoButton f m = Memo (f m) (memoiseTwoButton f (pressLeft m)) (memoiseTwoButton f (pressRight m))<br /><P><BR><br /></pre>The first argument to <tt>memoiseTwoButton</tt> says what we want to store in the table and then <tt>memoiseTwoButton</tt> goes ahead and stores it. We can use the identity function if we want to store the original elements. <P><BR>Note how this is like <tt>foldMap</tt>: <P><BR><pre><br />foldMap :: Monoid m => (a -> m) -> t a -> m<br /><P><BR><br /></pre>if we replace <tt>t</tt> by the list functor and remember that lists are free monoids. The main difference is that arrows have been reversed. Where <tt>foldMap</tt> takes an element of a free monoid and interprets it as an element of another monoid, <tt>memoiseTwoButton</tt> packs an element of a <tt>TwoButton</tt> into a cofree structure. The "interpretation" and "packing" here are both homomorphisms for their respective structures. Homomorphisms respect equations so if an equation holds between elements of a free monoid we expect it to also hold when interpreted in another monoid. But any element of a free monoid can be interpreted in any other monoid meaning that any equation that holds between elements of a free monoid must hold in any monoid. That's why free monoids are designed so that the only equations that hold between elements are those that follow from the monoid laws. <P><BR>With the <tt>TwoButton</tt> we have a dualised version of the above. Every element of every <tt>TwoButton</tt> can be packed into the <tt>CofreeTwoButton</tt>. So every equation in the original structure will still hold after the packing. So every equation that holds in some <tt>TwoButton</tt> must have some solution in <tt>CofreeTwoButton</tt>. That gives an idea of what a <tt>CofreeTwoButton</tt> is by analogy with the free monoid. <P><BR><b>Cofree comonads</b><p>A cofree comonad is basically a memoised comonad. So the data structure is: <P><BR><pre><br />> data Cofree f a = Cofree a (f (Cofree f a))<br /><P><BR><br /></pre>At each point in the "table" we store some observable value of type <tt>a</tt>. And we have a functorful of buttons, so we expect to have a functorful of new states we could transition to. The <tt>Functor</tt> instance looks like: <P><BR><pre><br />> instance Functor f => Functor (Cofree f) where<br />> fmap f (Cofree a fs) = Cofree (f a) (fmap (fmap f) fs)<br /><P><BR><br /></pre>We apply <tt>f</tt> to the observable value and then push the <tt>fmap f</tt> down to the child nodes. <P><BR>The <tt>duplicate</tt> function takes a memoised state and replaces the observable stored at each position with the memoised state that gives rise to the observable. <P><BR><pre><br />> instance Functor f => Comonad (Cofree f) where<br />> extract (Cofree a _) = a<br />> duplicate c@(Cofree _ fs) = Cofree c (fmap duplicate fs)<br /><P><BR><br /></pre>Now by analogy with <tt>memoiseTwoButton</tt> we can memoise comonads. <P><BR><pre><br />> memoiseComonad :: (Comonad w, Functor f) =><br />> (forall x.w x -> f x) -> (forall x.w x -> Cofree f x)<br />> memoiseComonad f w = Cofree (extract w) (fmap (memoiseComonad f) (f (duplicate w)))<br /><P><BR><br /></pre>So that's what a cofree comonad is: it's a type that can be used to memoise all of the states that are accessible from a state in a comonad by pressing its buttons. <P><BR><b>Cofree comonad meets free monad</b><p>But that's not all. There is a close relationship between cofree comonads and free monads. So to get going, here's a free monad type: <P><BR><pre><br />> data Free f a = Id a | Free (f (Free f a))<br /><P><BR><br />> join' :: Functor f => Free f (Free f a) -> Free f a<br />> join' (Id x) = x<br />> join' (Free fa) = Free (fmap join' fa)<br /><P><BR><br />> instance Functor f => Functor (Free f) where<br />> fmap f (Id x) = Id (f x)<br />> fmap f (Free fa) = Free (fmap (fmap f) fa)<br /><P><BR><br />> instance Functor f => Monad (Free f) where<br />> return = Id<br />> m >>= f = join' (fmap f m)<br /><P><BR><br /></pre>Now I'll define a kind of pairing between functors. Given a way to combine two kinds of element, the pairing gives a way to combine a pair of containers of those elements. <P><BR><pre><br />> class (Functor f, Functor g) => Pairing f g where<br />> pair :: (a -> b -> r) -> f a -> g b -> r<br /><P><BR><br />> data Identity a = Identity a<br />> instance Functor Identity where<br />> fmap f (Identity x) = Identity (f x)<br /><P><BR><br />> instance Pairing Identity Identity where<br />> pair f (Identity a) (Identity b) = f a b<br /><P><BR><br />> data (f :+: g) x = LeftF (f x) | RightF (g x)<br />> instance (Functor f, Functor g) => Functor (f :+: g) where<br />> fmap f (LeftF x) = LeftF (fmap f x)<br />> fmap f (RightF x) = RightF (fmap f x)<br /><P><BR><br />> data (f :*: g) x = f x :*: g x<br />> instance (Functor f, Functor g) => Functor (f :*: g) where<br />> fmap f (x :*: y) = fmap f x :*: fmap f y<br /><P><BR><br />> instance (Pairing f f', Pairing g g') => Pairing (f :+: g) (f' :*: g') where<br />> pair p (LeftF x) (a :*: _) = pair p x a<br />> pair p (RightF x) (_ :*: b) = pair p x b<br /><P><BR><br />> instance (Pairing f f', Pairing g g') => Pairing (f :*: g) (f' :+: g') where<br />> pair p (a :*: _) (LeftF x) = pair p a x<br />> pair p (_ :*: b) (RightF x) = pair p b x<br /><P><BR><br />> instance Pairing ((->) a) ((,) a) where<br />> pair p f = uncurry (p . f)<br /><P><BR><br /></pre>Given a pairing between <tt>f</tt> and <tt>g</tt> we get one between <tt>Cofree f</tt> and <tt>Free g</tt>. <P><BR><pre><br />> instance Pairing f g => Pairing (Cofree f) (Free g) where<br />> pair p (Cofree a _) (Id x) = p a x<br />> pair p (Cofree _ fs) (Free gs) = pair (pair p) fs gs<br /><P><BR><br /></pre>An element of <tt>Free g</tt> can be thought of as an expression written in a DSL. So this pairing gives a way to apply a monadic expression to a memoised comonad. In other words, if you think of comonads as machines, monads give a language that can be used to compute something based on the output of the machine. <P><BR>Here's an almost trivial example just so you can see everything working together. A reasonable definition of a comagma structure on the type <tt>a</tt> is <tt>a -> UpDown a</tt> with <tt>UpDown</tt> defined as: <P><BR><pre><br />> data UpDown a = Up a | Down a<br /><P><BR><br />> instance Functor UpDown where<br />> fmap f (Up a) = Up (f a)<br />> fmap f (Down a) = Down (f a)<br /><P><BR><br />> type CofreeComagma a = Cofree UpDown a<br /><P><BR><br /></pre>A well known comagma structure on the positive integers is given by the famous <a href="https://en.wikipedia.org/wiki/Collatz_conjecture">Collatz conjecture</a>: <P><BR><pre><br />> collatz :: Integer -> UpDown Integer<br />> collatz n = if even n then Down (n `div` 2) else Up (3*n+1)<br /><P><BR><br /></pre>We can memoise this as a cofree comonad: <P><BR><pre><br />> memoisedCollatz :: Integer -> CofreeComagma Integer<br />> memoisedCollatz n = Cofree n (fmap memoisedCollatz (collatz n))<br /><P><BR><br /></pre>Here's a picture of <tt>memoisedCollatz 12</tt>: <div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-48VwzL8iXk0/U39aecynoWI/AAAAAAAAIOI/KvJ4ZArwmjw/s1600/cofree.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-48VwzL8iXk0/U39aecynoWI/AAAAAAAAIOI/KvJ4ZArwmjw/s1600/cofree.png" /></a></div> <P><BR>Now let's make the dual functor in readiness for building the dual monad: <P><BR><pre><br />> data Two a = Two a a<br />> instance Functor Two where<br />> fmap f (Two a b) = Two (f a) (f b)<br /><P><BR><br /></pre>And here we set up a pairing: <P><BR><pre><br />> instance Pairing UpDown Two where<br />> pair f (Up a) (Two b _) = f a b<br />> pair f (Down a) (Two _ c) = f a c<br /><P><BR><br />> execute :: Cofree UpDown x -> Free Two (x -> r) -> r<br />> execute w m = pair (flip ($)) w m<br /><P><BR><br /></pre>This gives rise to a free monad isomorphic to the one in my talk: <P><BR><pre><br />> data Direction = WentUp | WentDown deriving Show<br /><P><BR><br />> choose :: Free Two Direction<br />> choose = Free (Two (return WentUp) (return WentDown))<br /><P><BR><br /></pre>And here's an example of some code written in the corresponding DSL: <P><BR><pre><br />> ex1 :: Free Two (Integer -> String)<br />> ex1 = do<br />> x <- choose<br />> y <- choose<br />> case (x, y) of<br />> (WentDown, WentDown) -> return (\z -> "Decreased twice " ++ show z)<br />> _ -> return show<br /><P><BR><br /></pre>It can be represented as: <P><BR> <div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-IlPy3xTUjvk/U39aw8ENnGI/AAAAAAAAIOQ/uHkKOBo_n8o/s1600/free.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-IlPy3xTUjvk/U39aw8ENnGI/AAAAAAAAIOQ/uHkKOBo_n8o/s1600/free.png" /></a></div> <P><BR>And here's what happens when they meet: <P><BR><pre><br />> go1 :: String<br />> go1 = execute (memoisedCollatz 12) ex1<br /><P><BR><br /></pre>This can be understood through the combined picture: <P><BR> <div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-r1-8N9YcvdU/U39bLwjPrTI/AAAAAAAAIOY/LVb5euAVD7Y/s1600/cofreemeetsfree.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-r1-8N9YcvdU/U39bLwjPrTI/AAAAAAAAIOY/LVb5euAVD7Y/s1600/cofreemeetsfree.png" /></a></div> <P><BR><b>References</b><p>On getting monads from comonads more generally see <a href="http://comonad.com/reader/2011/monads-from-comonads/">Monads from Comonads</a>. For more on memoising and how it's really all about the Yoneda lemma see <a href="http://blog.sigfpe.com/2009/11/memoizing-polymorphic-functions-with.html">Memoizing Polymorphic Functions</a>. I'm waiting for Tom Leinster to publish some related work. The pairing above gives a way for elements of free monads to pick out elements of cofree comonads and is a special case of what I'm talking about <a href="http://mathoverflow.net/a/12460">here</a>. But I think Tom has some unpublished work that goes further. <P><BR>If you think of a comonad as a compressed object that is decompressed by a monadic decision tree, then you'd expect some form of information theoretical description to apply. That makes me think of <a href="http://ncatlab.org/johnbaez/show/Convex+spaces+and+an+operadic+approach+to+entropy">Convex spaces and an operadic approach to entropy</a>. sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com8tag:blogger.com,1999:blog-11295132.post-49990623468641023252014-05-17T08:22:00.000-07:002014-05-17T09:07:37.059-07:00Types, and two approaches to problem solving<h2 dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 10pt;"><span style="background-color: transparent; color: black; font-family: 'Trebuchet MS'; font-size: 17px; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Introduction</span></h2><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">There are two broad approaches to problem solving that I see frequently in mathematics and computing. One is attacking a problem via subproblems, and another is attacking a problem via quotient problems. The former is well known though I’ll give some examples to make things clear. The latter can be harder to recognise but there is one example that just about everyone has known since infancy.</span></div><b id="docs-internal-guid-840c1bc0-0aa5-4d74-4ec7-038b3dfaa261" style="font-weight: normal;"><br /></b><br /><h2 dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 10pt;"><span style="background-color: transparent; color: black; font-family: 'Trebuchet MS'; font-size: 17px; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Subproblems</span></h2><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Consider sorting algorithms. A large class of sorting algorithms, including <a href="http://en.wikipedia.org/wiki/Quicksort">quicksort</a>, break a sequence of values into two pieces. The two pieces are smaller so they are easier to sort. We sort those pieces and then combine them, using some kind of merge operation, to give an ordered version of the original sequence. Breaking things down into subproblems is ubiquitous and is useful far outside of mathematics and computing: in cooking, in finding our path from A to B, in learning the contents of a book. So I don’t need to say much more here.</span></div><b style="font-weight: normal;"><br /></b><br /><h2 dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 10pt;"><span style="background-color: transparent; color: black; font-family: 'Trebuchet MS'; font-size: 17px; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Quotient problems</span></h2><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The term </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">quotient</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> is a technical term from mathematics. But I want to use the term loosely to mean something like this: a </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">quotient problem</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> is what a problem looks like if you wear a certain kind of filter over your eyes. The filter hides some aspect of the problem that simplifies it. You solve the simplified problem and then take off the filter. You now ‘lift’ the solution of the simplified problem to a solution to the full problem. The catch is that your filter needs to match your problem so I’ll start by giving an example where the filter doesn’t work.</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Suppose we want to add a list of integers, say: 123, 423, 934, 114. We can try simplifying this problem by wearing a filter that makes numbers fuzzy so we can’t distinguish numbers that differ by less than 10. When we wear this filter 123 looks like 120, 423 looks like 420, 934 looks like 930 and 114 looks like 110. So we can try adding 120+420+930+110. This is a simplified problem and in fact this is a common technique to get approximate answers via mental arithmetic. We get 1580. We might hope that when wearing our filters, 1580 looks like the correct answer. But it doesn’t. The correct answer is 1594. This filter doesn’t respect addition in the sense that if </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">a</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> looks like </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">a’</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> and </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">b</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> looks like </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">b’</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> it doesn’t follow that </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">a</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">+</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">b</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> looks like </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">a’</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">+</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">b</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">’.</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">To solve a problem via quotient problems we usually need to find a filter that does respect the original problem. So let’s wear a different filter that allows us just to see the last digit of a number. Our original problem now looks like summing the list 3, 3, 4, 4. We get 4. This </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">is</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> the correct last digit. If we now try a filter that allows us to see just the last two digits we see that summing 23, 23, 34, 14 does in fact give the correct last two digits. This is why the standard elementary school algorithms for addition and multiplication work through the digits from right to left: at each stage we’re solving a quotient problem but the filter only respects the original problem if it allows us to see the digits to the right of some point, not digits to the left. This filter </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">does</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> respect addition in the sense that if </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">a</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> looks like </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">a’</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> and </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">b</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> looks like </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">b’</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> then </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">a</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">+</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">b</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> looks like </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">a’</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">+</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">b</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">’.</span></div><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-M6YpX5gqPU8/U3d7-m612uI/AAAAAAAAIMI/foFvFkr6ui8/s1600/chessboard.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-M6YpX5gqPU8/U3d7-m612uI/AAAAAAAAIMI/foFvFkr6ui8/s1600/chessboard.png" height="320" width="320" /></a></div><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Another example of the quotient approach is to look at the <a href="http://en.wikipedia.org/wiki/Knight's_tour">knight’s tour</a> problem in the case where two opposite corners have been removed from the chessboard. A knight’s tour is a sequence of knight’s moves that visit each square on a board exactly once. If we remove opposite corners of the chessboard, there is no knight’s tour of the remaining 62 squares. How can we prove this? If you don’t see the trick you can get get caught up in all kinds of complicated reasoning. So now put on a filter that removes your ability to see the spatial relationships between the squares so you can only see the colours of the squares. This respects the original problem in the sense that a knight’s move goes from a black square to a white square, or from a white square to a black square. The filter doesn’t stop us seeing this. But now it’s easier to see that there are two more squares of one colour than the other and so no knight’s tour is possible. We didn’t need to be able to see the spatial relationships at all.</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">(Note that this is the same trick as we use for arithmetic, though it’s not immediately obvious. If we think of the spatial position of a square as being given by a pair of integers (x, y), then the colour is given by x+y modulo 2. In other words, by the last digit of x+y written in binary. So it’s just the see-only-digits-on-the-right filter at work again.)</span></div><b style="font-weight: normal;"><br /></b><br /><h2 dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 10pt;"><span style="background-color: transparent; color: black; font-family: 'Trebuchet MS'; font-size: 17px; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Wearing filters while programming</span></h2><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">So now think about developing some code in a dynamic language like Python. Suppose we execute the line:</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span style="font-family: Courier New, Courier, monospace;"><b>a = 1</b></span></span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The Python interpreter doesn’t just store the integer 1 somewhere in memory. It also stores a tag indicating that the data is to be interpreted as an integer. When you come to execute the line:</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><b><span style="font-family: Courier New, Courier, monospace;">b = a+1</span></b></span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">it will first examine the tag in </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">a</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> indicating its type, in this case </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">int</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">, and use that to determine what the type for </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">b</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> should be.</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Now suppose we wear a filter that allows us to see the tag indicating the type of some data, but not the data itself. Can we still reason about what our program does?</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">In many cases we can. For example we can, in principle, deduce the type of</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><b><span style="font-family: Courier New, Courier, monospace;">a+b*(c+1)/(2+d)</span></b></span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">if we know the types of </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><b>a</b></span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><b>b</b></span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><b>c</b></span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><b>d</b></span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. (As I’ve said once before, it’s hard to make any reliable statement about a bit of Python code so let's suppose that </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><b>a</b></span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><b>b</b></span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">, </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><b>c</b></span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> and </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><b>d</b></span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> are all either of type </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">int</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> or type </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">float</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">.) We can read and understand quite a bit of Python code wearing this filter. But it’s easy to go wrong. For example consider</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><b><span style="font-family: Courier New, Courier, monospace;">if a>1 then:</span></b></span></div><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><b><span style="font-family: Courier New, Courier, monospace;"><span style="background-color: transparent; color: black; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">return 1.0</span></span></b></div><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><b><span style="font-family: Courier New, Courier, monospace;">else:</span></b></span></div><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><b><span style="font-family: Courier New, Courier, monospace;"><span style="background-color: transparent; color: black; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-size: 15px; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">return 1</span></span></b></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The type of the result depends on the value of the variable a. So if we’re wearing the filter that hides the data, then we can’t predict what this snippet of code does. When we run it, it might return an </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">int</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> sometimes and a </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">float</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> other times, and we won’t be able to see what made the difference.</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">In a statically typed language you </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">can</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> predict the type of an expression knowing the type of its parts. This means you can reason reliably about code while wearing the hide-the-value filter. This means that almost any programming problem can be split into two parts: a quotient problem where you forget about the values, and then problem of lifting a solution to the quotient problem to a solution to the full problem. Or to put that in more conventional language: designing your data and function types, and then implementing the code that fits those types.</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">I chose to make the contrast between dynamic and static languages just to make the ideas clear but actually you can happily use similar reasoning for both types of language. Compilers for statically typed languages, give you a lot of assistance if you choose to solve your programming problems this way.</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">A good example of this at work is given in Haskell. If you're writing a compiler, say, you might want to represent a piece of code as an abstract syntax tree, and implement algorithms that recurse through the tree. In Haskell the type system is strong enough that once you’ve defined the tree type the form of the recursion algorithms is often more or less given. In fact, it can be tricky to implement tree recursion </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">in</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">correctly and have the code compile without errors. Solving the quotient problem of getting the types right gets you much of the way towards solving the full problem.</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">And that’s my main point: types aren’t simply a restriction mechanism to help you avoid making mistakes. Instead they are a way to reduce some complex programming problems to simpler ones. But the simpler problem isn’t a </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">sub</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">problem, it’s a </span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: italic; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">quotient</span><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> problem.</span></div><h2 dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 10pt;"><span style="background-color: transparent; color: black; font-family: 'Trebuchet MS'; font-size: 17px; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Dependent types</span></h2><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Dependently typed languages give you even more flexibility with what filters you wear. They allow you to mix up values and types. For example both C++ and Agda (to pick an unlikely pair) allow you to wear filters that hide the values of elements in your arrays while allowing you to see the length of your arrays. This makes it easier to concentrate on some aspects of your problem while completely ignoring others.</span></div><b style="font-weight: normal;"><br /></b><br /><h2 dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 10pt;"><span style="background-color: transparent; color: black; font-family: 'Trebuchet MS'; font-size: 17px; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Notes</span></h2><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">I wrote the first draft of this a couple of years ago but never published it. I was motivated to post by a discussion kicked off by Voevodsky on the TYPES mailing list </span><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;"><a href="http://lists.seas.upenn.edu/pipermail/types-list/2014/001745.html" style="text-decoration: none;">http://lists.seas.upenn.edu/pipermail/types-list/2014/001745.html</a></span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">This article isn’t a piece of rigorous mathematics and I’m using mathematical terms as analogies.</span></div><b style="font-weight: normal;"><br /></b><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The notion of a subproblem isn’t completely distinct from a quotient problem. Some problems are both, and in fact some problems can be solved by transforming them so they become both.</span></div><br /><div dir="ltr" style="line-height: 1.15; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">More generally, looking at computer programs through different filters is one approach to abstract interpretation </span><a href="http://en.wikipedia.org/wiki/Abstract_interpretation" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">http://en.wikipedia.org/wiki/Abstract_interpretation</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. The intuition section there (</span><a href="http://en.wikipedia.org/wiki/Abstract_interpretation#Intuition" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">http://en.wikipedia.org/wiki/Abstract_interpretation#Intuition</span></a><span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">) has much in common with what I’m saying.</span></div>sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com4tag:blogger.com,1999:blog-11295132.post-31935670455334094142014-04-25T21:42:00.000-07:002014-04-25T21:51:11.820-07:00The Monad called Free<b>Introduction</b><p>As Dan Doel points out <a href="https://www.fpcomplete.com/user/dolio/many-roads-to-free-monads">here</a>, the gadget <tt>Free</tt> that turns a functor into a monad is itself a kind of monad, though not the usual kind of monad we find in Haskell. I'll call it a higher order monad and you can find a type class corresponding to this in various places including an old version of Ed Kmett's <a href="http://comonad.com/haskell/category-extras/dist/doc/html/category-extras/Control-Monad-HigherOrder.html">category-extras</a>. I'll borrow some code from there. I hunted around and couldn't find an implementation of <tt>Free</tt> as an instance of this class so I thought I'd plug the gap. <P><BR><pre><br />> {-# LANGUAGE RankNTypes, FlexibleContexts, InstanceSigs, ScopedTypeVariables #-}<br /><P><BR><br />> import Control.Monad<br />> import Data.Monoid<br /><P><BR><br /></pre>To make things unambiguous I'll implement free monads in the usual way here: <P><BR><pre><br />> data Free f a = Pure a | Free (f (Free f a))<br /><P><BR><br />> instance Functor f => Functor (Free f) where<br />> fmap f (Pure a) = Pure (f a)<br />> fmap f (Free a) = Free (fmap (fmap f) a)<br /><P><BR><br />> instance Functor f => Monad (Free f) where<br />> return = Pure<br />> Pure a >>= f = f a<br />> Free a >>= f = Free (fmap (>>= f) a)<br /><P><BR><br /></pre>The usual Haskell typeclass <tt>Monad</tt> corresponds to monads in the category of types and functions, <tt>Hask</tt>. We're going to want monads in the category of endomorphisms of <tt>Hask</tt> which I'll call <tt>Endo</tt>. <P><BR>The objects in <tt>Endo</tt> correspond to Haskell's <tt>Functor</tt>. The arrows in <tt>Endo</tt> are the natural transformations between these functors: <P><BR><pre><br />> type Natural f g = (Functor f, Functor g) => forall a. f a -> g a<br /><P><BR><br /></pre>So now we are led to consider functors in <tt>Endo</tt>. <P><BR><pre><br />> class HFunctor f where<br /><P><BR><br /></pre>A functor in <tt>Endo</tt> must map functors in <tt>Hask</tt> to functors in <tt>Hask</tt>. So if <tt>f</tt> is a functor in <tt>Endo</tt> and <tt>g</tt> is a functor in <tt>Hask</tt>, then <tt>f g</tt> must be another functor in <tt>Hask</tt>. So there must be an <tt>fmap</tt> associated with this new functor. There's an associated <tt>fmap</tt> for every <tt>g</tt> and we collect them all into one big happy natural family: <P><BR><pre><br />> ffmap :: Functor g => (a -> b) -> f g a -> f g b<br /><P><BR><br /></pre>But note also that by virtue of being a functor itself, <tt>f</tt> must have its own <tt>fmap</tt> type function associated with it. The arrows in <tt>Endo</tt> are natural transformations in <tt>Hask</tt> so the <tt>fmap</tt> for <tt>HFunctor</tt> must take arrows in <tt>Endo</tt> to arrows in <tt>Endo</tt> like so: <P><BR><pre><br />> hfmap :: (Functor g, Functor h) => Natural g h -> Natural (f g) (f h)<br /><P><BR><br /></pre>Many constructions in the category <tt>Hask</tt> carry over to <tt>Endo</tt>. In <tt>Hask</tt> we can form a product of type types <tt>a</tt> and <tt>b</tt> as <tt>(a, b)</tt>. In <tt>Endo</tt> we form the product of two functors <tt>f</tt> and <tt>g</tt> as <P><BR><pre><br />> data Product f g a = Product (f (g a))<br /><P><BR><br /></pre>Note that this product isn't commutative. We don't necessarily have an isomorphism from <tt>Product f g</tt> to <tt>Product g f</tt>. (This breaks many attempts to transfer constructions from <tt>Hask</tt> to <tt>Endo</tt>.) We also won't explicitly use <tt>Product</tt> because we can simply use the usual Haskell composition of functors inline. <P><BR>We can implement some functions that act on product types in both senses of the word "product": <P><BR><pre><br />> left :: (a -> c) -> (a, b) -> (c, b)<br />> left f (a, b) = (f a, b)<br /><P><BR><br />> right :: (b -> c) -> (a, b) -> (a, c)<br />> right f (a, b) = (a, f b)<br /><P><BR><br />> hleft :: (Functor a, Functor b, Functor c) => Natural a c -> a (b x) -> c (b x)<br />> hleft f = f<br /><P><BR><br />> hright :: (Functor a, Functor b, Functor c) => Natural b c -> a (b x) -> a (c x)<br />> hright f = fmap f<br /><P><BR><br /></pre>(Compare with what I wrote <a href="http://blog.sigfpe.com/2008/11/from-monoids-to-monads.html">here</a>.) <P><BR>We have something in <tt>Endo</tt> a bit like the type with one element in <tt>Hask</tt>, namely the identity functor. The product of a type <tt>a</tt> with the one element type in <tt>Hask</tt> gives you something isomorphic to <tt>a</tt>. In <tt>Endo</tt> the product is composition for which the identity functor is the identity. (Two different meanings of the word "identity" there.) <P><BR>We also have sums. For example, if we define a functor like so <P><BR><pre><br />> data F a = A a | B a a<br /><P><BR><br /></pre>we can think of <tt>F</tt> as a sum of two functors: one with a single constructor <tt>A</tt> and another with constructor <tt>B</tt>. <P><BR>We can now think about reproducing an <tt>Endo</tt> flavoured version of lists. The usual definition is isomorphic to: <P><BR><pre><br />> data List a = Nil | Cons a (List a)<br /><P><BR><br /></pre>And it has a <tt>Monoid</tt> instance: <P><BR><pre><br />> instance Monoid (List a) where<br />> mempty = Nil<br />> mappend Nil as = as<br />> mappend (Cons a as) bs = Cons a (mappend as bs)<br /><P><BR><br /></pre>We can try to translate that into <tt>Endo</tt>. The <tt>Nil</tt> part can be thought of as being an element of a type with one element so it should become the identity functor. The <tt>Cons a (List a)</tt> part is a product of <tt>a</tt> and <tt>List a</tt> so that should get replaced by a composition. So we expect to see something vaguely like: <P><BR><pre><br />List' a = Nil' | Cons' (a (List' a))<br /><P><BR><br /></pre>That's not quite right because <tt>List' a</tt> is a functor, not a type, and so acts on types. So a better definition would be: <P><BR><pre><br />List' a b = Nil' b | Cons' (a (List' a b))<br /><P><BR><br /></pre>That's just the definition of <tt>Free</tt>. So free monads are lists in <tt>Endo</tt>. As everyone knows :-) monads are just monoids in the category of endofunctors. Free monads are also just free monoids in the category of endofunctors. <P><BR>So now we can expect many constructions associated with monoids and lists to carry over to monads and free monads. <P><BR>An obvious one is the generalization of the singleton map <tt>a -> List a</tt>: <P><BR><pre><br />> singleton :: a -> List a<br />> singleton a = Cons a Nil<br /><P><BR><br />> hsingleton :: Natural f (Free f)<br />> hsingleton f = Free (fmap Pure f)<br /><P><BR><br /></pre>Another is the generalization of <tt>foldMap</tt>. This can be found under a variety of names in the various <a href="http://www.haskellforall.com/2012/06/you-could-have-invented-free-monads.html">free monad</a> libraries out there but this implementation is designed to highlight the similarity between monoids and monads: <P><BR><pre><br />> foldMap :: Monoid m => (a -> m) -> List a -> m<br />> foldMap _ Nil = mempty<br />> foldMap f (Cons a as) = uncurry mappend $ left f $ right (foldMap f) (a, as)<br /><P><BR><br />> fold :: Monoid m => List m -> m<br />> fold = foldMap id<br /><P><BR><br />> hFoldMap :: (Functor f, Functor m, Monad m) => Natural f m -> Natural (Free f) m<br />> hFoldMap _ (Pure x) = return x<br />> hFoldMap f (Free x) = join $ hleft f $ hright (hFoldMap f) x<br /><P><BR><br />> hFold :: Monad f => Natural (Free f) f<br />> hFold = hFoldMap id<br /><P><BR><br /></pre>The similarity here isn't simply formal. If you think of a list as a sequence of instructions then <tt>foldMap</tt> interprets the sequence of instructions like a computer program. Similarly <tt>hFoldMap</tt> can be used to <a href="http://debasishg.blogspot.com/2013/01/a-language-and-its-interpretation.html">interpret</a> programs for which the free monad provides an abstract syntax tree. <P><BR>You'll find some of these functions <a href="https://hackage.haskell.org/package/control-monad-free-0.5.3/docs/Control-Monad-Free.html">here</a> by different names. <P><BR>Now we can consider <tt>Free</tt>. It's easy to show this is a <tt>HFunctor</tt> by copying a suitable definition for <tt>List</tt>: <P><BR><pre><br />> instance Functor List where<br />> fmap f = foldMap (singleton . f)<br /><P><BR><br />> instance HFunctor Free where<br />> ffmap = fmap<br />> hfmap f = hFoldMap (hsingleton . f)<br /><P><BR><br /></pre>We can define <tt>HMonad</tt> as follows: <P><BR><pre><br />> class HMonad m where<br />> hreturn :: Functor f => f a -> m f a<br />> hbind :: (Functor f, Functor g) => m f a -> Natural f (m g) -> m g a<br /><P><BR><br /></pre>Before making <tt>Free</tt> an instance, let's look at how we'd make <tt>List</tt> an instance of <tt>Monad</tt><P><BR><pre><br />> instance Monad List where<br />> return = singleton<br />> m >>= f = fold (fmap f m)<br /><P><BR><br /></pre>And now the instance I promised at the beginning. <P><BR><pre><br />> instance HMonad Free where<br />> hreturn = hsingleton<br />> hbind m f = hFold (hfmap f m)<br /><P><BR><br /></pre>I've skipped the proofs that the monad laws hold and that <tt>hreturn</tt> and <tt>hbind</tt> are actually natural transformations in <tt>Endo</tt>. Maybe I'll leave those as exercises for the reader. <P><BR><b>Update</b><p>After writing this I tried googling for "instance HMonad Free" and I found <a href="http://lpaste.net/93806">this</a> by haasn. There's some other good stuff in there too. sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com6tag:blogger.com,1999:blog-11295132.post-57154742591009961052014-02-01T17:53:00.000-08:002014-02-01T21:24:33.184-08:00Reinversion Revisited<b>Introduction</b><p>A <a href="http://blog.sigfpe.com/2011/10/quick-and-dirty-reinversion-of-control.html">while back</a> I talked about the idea of reinversion of control using the continuation monad to wrest control back from an interface that only wants to call you, but doesn't want you to call them back. I want to return to that problem with a slightly different solution. The idea is that we build an interpreter for an imperative language that's an embedded Haskell DSL. You arrange that the DSL does the work of waiting to be called by the interface, but from the point of view of the user of the DSL it looks like you're calling the shots. To do this I'm going to pull together a bunch of techniques I've talked about before. This approach is largely an application of what apfelmus described <a href="http://apfelmus.nfshost.com/articles/operational-monad.html">here</a>. <P><BR><b>The code</b><p>We'll start with some administrative stuff before getting down to the real code: <P><BR><pre><br />> {-# LANGUAGE TemplateHaskell #-}<br /><P><BR><br />> import Control.Lens<br />> import Control.Monad<br />> import Control.Monad.Loops<br /><P><BR><br /></pre>We'll make our DSL an imperative wrapper around Gloss: <P><BR><pre><br />> import Graphics.Gloss.Interface.Pure.Game<br /><P><BR><br /></pre>We'll define a structure that can be used to represent the abstract syntax tree (AST) of our DSL. Our DSL will support the reading of inputs, adding pictures to the current picture, and clearing the screen. <P><BR>First we'll need a wrapper that allows us to represent ordinary Haskell values in our DSL: <P><BR><pre><br />> data Basic a = Return a<br /><P><BR><br /></pre>Now we want an expression that represents events given to us by Gloss. Internally we'll represent this by a function that says what our program does if it's given an event. It says what our program does by returning another AST saying what happens when the input is received. (I've previously talked about these kinds of expression trees <a href="http://blog.sigfpe.com/2009/12/where-do-monads-come-from.html">here</a>). <P><BR><pre><br />> | Input (Event -> Basic a)<br /><P><BR><br /></pre>We have a command to render some graphics. It appends a new <tt>Picture</tt> to the current picture. Again, part of the AST muct be another AST saying what happens after the picture is rendered: <P><BR><pre><br />> | Render Picture (Basic a)<br /><P><BR><br /></pre>And lastly here's the AST for a clear screen command: <P><BR><pre><br />> | Cls (Basic a)<br /><P><BR><br /></pre>Our AST will form a monad. This will allow us to build ASTs using ordinary Haskell do-notation. This technique is what I described previously <a href="http://blog.sigfpe.com/2010/01/monads-are-trees-with-grafting.html">here</a>. <P><BR><pre><br />> instance Monad Basic where<br />> return = Return<br />> Return a >>= f = f a<br />> Input handler >>= f = Input (\e -> handler e >>= f)<br />> Render p a >>= f = Render p (a >>= f)<br />> Cls a >>= f = Cls (a >>= f)<br /><P><BR><br /></pre>You can think of the expression <tt>x >>= f</tt> as <tt>x</tt> with the tree <tt>f a</tt> grafted in to replace any occurrence of <tt>Return a</tt> in it. This is exactly what <tt>Return a >>= f</tt> does. But applying <tt>>>= f</tt> to the other ASTs simply digs down "inside" the ASTs to find other occurrences of <tt>Return a</tt>. <P><BR>It's convenient to uses lenses to view Gloss's game world: <P><BR><pre><br />> data World = World { _program :: Basic (), _picture :: Picture }<br />> $(makeLenses ''World)<br /><P><BR><br /></pre>And now we have some wrappers around the interpreter's commands. The <tt>return ()</tt> provides the convenient place where we can graft subtrees into our AST. <P><BR><pre><br />> input = Input return<br />> render p = Render p (return ())<br />> cls = Cls (return ())<br /><P><BR><br /></pre>Now we can start coding. Here's a test to see if a Gloss event is a key down event: <P><BR><pre><br />> keydown (EventKey (Char key) Down _ _) = True<br />> keydown (EventKey (SpecialKey KeySpace) Down _ _) = True<br />> keydown _ = False<br /><P><BR><br /></pre>And now here's a complete program using our DSL. It's deliberately very imperative. It simply iterates over a nested pair of loops, collecting keystrokes and displaying them. It reads a lot like an ordinary program written in a language like Python or Basic: <P><BR><pre><br />> mainProgram = do<br />> render (Color white $ Scale 0.2 0.2 $ Text "Type some text")<br /><P><BR><br />> forM_ [780, 760..] $ \ypos -> do<br />> forM_ [0, 20..980] $ \xpos -> do<br /><P><BR><br />> event <- iterateUntil keydown $ input<br /><P><BR><br />> let key = case event of<br />> EventKey (Char key) Down _ _ -> key<br />> EventKey (SpecialKey KeySpace) Down _ _ -> ' '<br /><P><BR><br />> when (ypos == 780 && xpos == 0) $ cls<br />> render $ Color white $ Translate (xpos-500) (ypos-400) $ Scale 0.2 0.2 $ Text $ [key]<br /><P><BR><br /></pre>Here is where we launch everything, placing our program and starting <tt>Blank</tt> picture into the <tt>World</tt>. <P><BR><pre><br />> main = play (InWindow "Basic" (1000, 800) (10, 10))<br />> black <br />> 60<br />> (World mainProgram Blank)<br />> (^. picture)<br />> handleEvent<br />> (const id)<br /><P><BR><br /></pre>So now we need just one more ingredient, an actual interpreter for our AST. It's the event handler: <P><BR><pre><br />> handleEvent :: Event -> World -> World<br /><P><BR><br /></pre>The <tt>Return</tt> command is purely a place to graft in subtrees. It should never be interpreted. <P><BR><pre><br />> handleEvent _ (World (Return a) _) = error "error!"<br /><P><BR><br /></pre>After receiving some input, I want the interpreter to keep interpreting commands such as <tt>Cls</tt> that don't need any more input. I'm going to do this by using a null event <tt>EventMotion (0,0)</tt>. But when an input really is desired, I want this null event to be ignored. <P><BR><pre><br />> handleEvent (EventMotion (0, 0)) state@(World (Input handler) _) = state<br /><P><BR><br /></pre>We render something by <tt>mappend</tt>ing it to the current picture stored in the <tt>World</tt>. But the rendering is carried out by the event handler. We update the state so that at the next event, the subtree of the AST is executed. This means that after updating the picture, the event still needs to be handed back to the event handler: <P><BR><pre><br />> handleEvent event state@(World (Render p cont) _) = state & (picture <>~ p) & (program .~ cont) & handleEvent event<br /><P><BR><br /></pre>Clearing the screen is similar: <P><BR><pre><br />> handleEvent event state@(World (Cls cont) _) = state & (picture .~ Blank) & (program .~ cont) & handleEvent event<br /><P><BR><br /></pre>And now we need to handle inputs. We do this by applying the "what happens when the input is received" function to the event. The result is put back in the state indicating that this is what we want to happen at the next event. So the interpreter doesn't stop here, waiting for the next event, the interpreter sends itself a null event. <P><BR><pre><br />> handleEvent event state@(World (Input handler) _) = state & (program .~ handler event) & handleEvent (EventMotion (0, 0))<br /><P><BR><br /></pre>And that's it! <P><BR>There are many changes that can be made. We can easily add more commands and make the state more complex. But you might also notice that we create the AST only to tear it apart again in the interpreter. We can actually elide the AST creation, but that will eventually bring us back to something like what I originally <a href="http://blog.sigfpe.com/2011/10/quick-and-dirty-reinversion-of-control.html">posted</a>. This shouldn't be a big surprise, I've already shown how any monad can be replaced with the continuation monad <a href="http://blog.sigfpe.com/2008/12/mother-of-all-monads.html">here</a>. By the way, it's pretty easy to add a <tt>Fork</tt> command. You can replace the <tt>_program :: Basic()</tt> field with <tt>_program :: [Basic ()]</tt> and interpret this as a list of threads using a scheduler of your choice. <P><BR><b>Acknowledgements</b><p>I was prompted to write this (a little late, I know) after reading <a href="http://www.thev.net/PaulLiu/invert-inversion.html">this article</a> and Tekmo's <a href="http://www.reddit.com/r/haskell/comments/z4inb/invert_the_inversion_of_control/c61fzrd">post on reddit</a>. I think ultimately continuations may perform better than using ASTs. But sometimes it's nice to build an AST because they give you an object that can easily be reasoned about and manipulated by code. Much as I love trickery with continuations, I find ASTs are much easier to think about. <P><BR><b>Postscript</b><p>My real motivation was that I was thinking about games. The rules of games are often given in imperative style: first player 1 does this. Then they do this. If this happens they do that. And then it's player two's turn. I wanted my Haskell code to reflect that style. <P><BR><b>Update</b><p>Added 'null' event to keep interpreter going when it makes sense to do so, but there's no event pending. sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com4tag:blogger.com,1999:blog-11295132.post-28604911791596145242013-10-25T21:00:00.000-07:002013-10-29T18:50:14.446-07:00Distributed computing with alien technology<b>Introduction</b><p>Suppose we are given a function <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle"> of <img src="https://chart.googleapis.com/chart?cht=tx&chl=2n" style="vertical-align:middle"> boolean arguments that returns a boolean result. Alice has <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> bits, <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5C%7Bx_1%2C%20x_2%2C%20%5Cdots%2C%20x_n%5C%7D" style="vertical-align:middle"> and Bob has another <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> bits <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5C%7By_1%2C%20y_2%2C%20%5Cldots%2C%20y_n%5C%7D" style="vertical-align:middle">. Alice and Bob are widely separated and don't know each other's bits. What is the total number of bits that Alice has to send to Bob and that Bob has to send to Alice so that between them they can compute <img src="https://chart.googleapis.com/chart?cht=tx&chl=f%28x_1%2C%20%5Cldots%2C%20x_n%2C%20y_1%2C%20%5Cldots%2C%20y_n%29" style="vertical-align:middle">? Think about how complex <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle"> might get. The <img src="https://chart.googleapis.com/chart?cht=tx&chl=x_i" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=y_i" style="vertical-align:middle"> might each describe half of a "voxelised" region of space and <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle"> might answer a question about a computational fluid dynamics (CFD) simulation running in that space. CFD simulations can be chaotic and so we might expect that in the worst case many bits have to be transferred back and forth between Alice and Bob. In the worst case we might expect that Alice has to send Bob all of her bits, or vice versa. <P><BR>But in fact Alice needs to send Bob just one bit. <P><BR><b>A loophole</b><p>To get the communication requirements down to one bit we need to use a loophole. But I hope to (1) justify the cheat to some extent and (2) justify that it's even worthwhile to think about cheats. <P><BR>Alice and Bob have access to some <a href="http://en.wikipedia.org/wiki/Ancient_technology_in_Stargate">Ancient technology</a>. They each have one of a pair of boxes. At prearranged times, Alice puts a bit into her box, and Bob puts a bit into his box. A bit pops back out of Alice's box and a bit pops back out of Bob's box. Whatever the input, both Alice and Box have a 0.5 chance of seeing a one or zero pop out of their respective boxes. But when the two outputs are <a href="http://en.wikipedia.org/wiki/Exclusive_or">XORed</a> together the result is the logical AND of the two inputs. With such boxes, Alice can compute <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle"> after Bob sends a single bit down a conventional communication channel. <P><BR> <center><a href="http://4.bp.blogspot.com/-W8wrhhYGSYo/Ums6t6E8B2I/AAAAAAAAEUY/38oA2Cdk2fs/s1600/nonlocal.png" imageanchor="1"><img border="0" src="http://4.bp.blogspot.com/-W8wrhhYGSYo/Ums6t6E8B2I/AAAAAAAAEUY/38oA2Cdk2fs/s400/nonlocal.png" /></a></center> <P><BR>"But this is a total cheat!" you complain before I even start to explain their technique. It seems Alice receives a bit that depends on what Bob input, and so Bob is communicating with Alice. But look closely and you'll see that the boxes don't allow any communication. No matter what Bob inputs, Alice has a 0.5 chance of getting zero or one. There is no way Bob can use this to communicate anything. It's like intercepting a message encrypted with a <a href="http://en.wikipedia.org/wiki/One-time_pad">one time pad</a>. Without the pad, the message is basically a sequence of random bits. Nonetheless, it is true that the outputs that Alice and Bob see are correlated. <P><BR>I hope I've convinced you that Alice and Bob can't send any bits with these boxes. Despite this, it is pretty clear that the behaviour of the boxes is non-local. We'll call any kind of boxes that allow instantaneous long range correlations that can't be explained by purely local behaviour <i>non-local boxes</i>. Boxes that can't be used for message sending are called <i>non-signalling local boxes</i>. And the particular non-local box I describe above is called a <i>PR box</i> (eg. see <a href="http://en.wikipedia.org/wiki/Quantum_nonlocality#Superquantum_nonlocality">here</a>). <P><BR>(BTW As an aside note that as the box results in widely separated outputs that are correlated, but doesn't allow communication, it's an example of how non-locality doesn't imply communication. Usually when people want to give examples of such a thing they talk about quantum mechanics. But there's no need to mention quantum mechanics to explain the behaviour of these particular non-local boxes.) <P><BR><b>The method</b><p>Any single bit boolean function of a finite sequence of bits can be written as a polynomial modulo 2. Each monomial in the polynomial can be written as a product of terms involing just the <img src="https://chart.googleapis.com/chart?cht=tx&chl=x_i" style="vertical-align:middle"> and terms involving just the <img src="https://chart.googleapis.com/chart?cht=tx&chl=y_i" style="vertical-align:middle">, ie. <blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=f%28x_1%2C%20%5Cldots%2C%20x_n%2C%20y_1%2C%20%5Cldots%2C%20y_n%29%20%3D%20%5Csum_%7Bj%7DA_jB_j" style="vertical-align:middle"></blockquote>where <img src="https://chart.googleapis.com/chart?cht=tx&chl=A_j" style="vertical-align:middle"> depends only on the <img src="https://chart.googleapis.com/chart?cht=tx&chl=x_i" style="vertical-align:middle">, <img src="https://chart.googleapis.com/chart?cht=tx&chl=B_j" style="vertical-align:middle"> depends only on the <img src="https://chart.googleapis.com/chart?cht=tx&chl=y_i" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=j" style="vertical-align:middle"> is drawn from some finite set. Alice can compute the <img src="https://chart.googleapis.com/chart?cht=tx&chl=A_j" style="vertical-align:middle"> and Bob can compute the <img src="https://chart.googleapis.com/chart?cht=tx&chl=B_j" style="vertical-align:middle">. Now Alice and Bob, in parallel, feed <img src="https://chart.googleapis.com/chart?cht=tx&chl=A_j" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=B_j" style="vertical-align:middle"> respectively into their PR boxes. We know that we could evaluate each term in the sum we want by adding Alice's output to Bob's output. But that would require sending one one-bit message for each <img src="https://chart.googleapis.com/chart?cht=tx&chl=j" style="vertical-align:middle">. But we don't need each term one by one; we just want the sum. So Alice and Bob can individually sum their separate outputs knowing that adding Alice's output and Bob's output modulo 2 will be the correct sum. So Bob sends his sum to Alice. Alice adds that number to her own (modulo 2) and that's the value we want. Only one one-bit message was sent. <P><BR><b>But what about reality?</b><p>Non-local boxes don't exist, do they? So why are we talking about them? <P><BR>Actually, non-local boxes exist both theoretically and in the lab. <a href="http://en.wikipedia.org/wiki/Quantum_nonlocality">Non-local correlations in quantum mechanics</a> allow them to be constructed. But for this article I wanted to abstract from quantum mechanics and talk about the behaviour of a non-local box without getting my hands dirty with the details of quantum mechanics. Having said that, although non-local boxes do exist, the special case of the PR box <a href="http://en.wikipedia.org/wiki/Tsirelson's_bound">can't</a> in fact be constructed with quantum mechanics. In some sense it allows correlations that are "too strong". An <a href="http://blog.sigfpe.com/2010/11/beating-odds-with-entangled-qubits.html">article</a> I wrote a while back describes the closest you can get to building a PR box with quantum correlations. Curiously, if you restrict yourself to the kind of non-local box quantum mechanics allows you to build you find that some functions can still be computed with less communication than you'd need if non-local correlations are disallowed. Nonetheless, the worst case scenario with QM still requires the sending of <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle"> bits. <P><BR>Going further there's an interesting conjecture. It says that any non-local box that is even marginally better (in some sense) than what quantum mechanics allows is powerful enough to allow the computation of any <img src="https://chart.googleapis.com/chart?cht=tx&chl=f" style="vertical-align:middle"> with only a single bit of communication. It suggests that quantum mechanics is right at the edge of the space of possible physics that make life difficult for us. If quantum mechanics were to be tweaked the tiniest amount to make correlations any stronger, large numbers of difficult distributed computing problems would suddenly collapse to become trivial. If the conjecture is true it means that nature looks a bit like a conspiracy to keep computer scientists in work. (It's possible the conjecture has been decided one way or the other by now.) <P><BR><b>Final words</b><p>There are a couple of papers about universes where PR boxes can be built; so called <a href="http://arxiv.org/abs/1210.0618">boxworlds</a>. There is a <a href="http://arxiv.org/abs/quant-ph/0211089">lot</a> of interesting theoretical work in characterising quantum mechanics. In particular there are a number of theorems and conjectures that describe QM in the form "the most X theory that doesn't allow Y" where X is an interesting property and Y is something you'd like to do. <P><BR><b>References</b><p>I learnt all of this from the paper <a href="http://arxiv.org/abs/quant-ph/0501159">Implausible Consequences of Superstrong Nonlocality</a> by Wim van Dam. sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com3tag:blogger.com,1999:blog-11295132.post-23255842911365007142013-10-12T09:12:00.000-07:002013-10-12T09:47:35.839-07:00What stops us defining Truth?<b>Introduction</b><br />Recall the standard cartoon sketch of the proof of Gödel's first <a href="http://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_theorems">incompleteness theorem</a>. We start by defining a predicate, <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov" style="vertical-align: middle;" />, that is true if and only if its argument is provable. (Or more accurately, <img src="https://chart.googleapis.com/chart?cht=tx&chl=Prov%28g%29" style="vertical-align: middle;" /> is true if <img src="https://chart.googleapis.com/chart?cht=tx&chl=g" style="vertical-align: middle;" /> is the Gödel number of a provable proposition.) With some <a href="http://en.wikipedia.org/wiki/Quine_%28computing%29">quining</a> we can use this to construct the proposition <img src="https://chart.googleapis.com/chart?cht=tx&chl=G" style="vertical-align: middle;" /> which says <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cneg%20Prov%28G%29" style="vertical-align: middle;" />. The proposition <img src="https://chart.googleapis.com/chart?cht=tx&chl=G" style="vertical-align: middle;" /> asserts its own unprovability. <br />Suppose instead we define a predicate <img src="https://chart.googleapis.com/chart?cht=tx&chl=True" style="vertical-align: middle;" /> which holds if its argument is true. We can use this to construct the proposition <img src="https://chart.googleapis.com/chart?cht=tx&chl=H" style="vertical-align: middle;" /> which says <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cneg%20True%28H%29" style="vertical-align: middle;" />. Then if <img src="https://chart.googleapis.com/chart?cht=tx&chl=H" style="vertical-align: middle;" /> is true it must also be false and if it's false then it must be true. We seem to have a paradox. The loophole is that we assumed the existence of the predicate <img src="https://chart.googleapis.com/chart?cht=tx&chl=True" style="vertical-align: middle;" />. So this argument demonstrates that there is actually no such predicate. This is <a href="http://en.wikipedia.org/wiki/Tarski's_undefinability_theorem">Tarski's undefinability theorem</a>. <br />But what exactly stops us defining <img src="https://chart.googleapis.com/chart?cht=tx&chl=True" style="vertical-align: middle;" />? What goes wrong if we attempt to define a predicate that analyses the parts of a proposition to tell us whether or not it is true? <br /><br /><b>Note</b><br />This article is written in English. But as is standard in much of mathematics, unless I state otherwise, I'm using English largely as shorthand for an argument that could, in principle, be written in the formal language of Set Theory. So I will allow myself to use all of the usual reasoning methods that are available in ZF, even when talking about other formal systems such as Peano Arithmetic. <br /><br /><b>Defining Truth for Propositional Calculus</b><br />Suppose we're given a proposition from propositional calculus like <img src="https://chart.googleapis.com/chart?cht=tx&chl=A%5Cwedge%20B" style="vertical-align: middle;" />. We can use a syntactic approach to determining whether or not it is true. We determine whether or not <img src="https://chart.googleapis.com/chart?cht=tx&chl=A" style="vertical-align: middle;" /> is true, then whether or not <img src="https://chart.googleapis.com/chart?cht=tx&chl=B" style="vertical-align: middle;" /> is true, and then the whole proposition is true if both <img src="https://chart.googleapis.com/chart?cht=tx&chl=A" style="vertical-align: middle;" /> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=B" style="vertical-align: middle;" /> are true. Similarly <img src="https://chart.googleapis.com/chart?cht=tx&chl=A%5Cvee%20B" style="vertical-align: middle;" /> is true if either <img src="https://chart.googleapis.com/chart?cht=tx&chl=A" style="vertical-align: middle;" /> or <img src="https://chart.googleapis.com/chart?cht=tx&chl=B" style="vertical-align: middle;" /> is true. Of course <img src="https://chart.googleapis.com/chart?cht=tx&chl=A" style="vertical-align: middle;" /> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=B" style="vertical-align: middle;" /> might themselves be compound propositions using <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cwedge" style="vertical-align: middle;" />, <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cvee" style="vertical-align: middle;" /> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cneg" style="vertical-align: middle;" />. But that's fine, that simply means that to define truth for such propositions we need to employ recursion. In fact, we can straightforwardly turn such a definition into a recursive computer program. <br />(Ultimately with propositional calculus we hit the leaves which are atomic propositions like <img src="https://chart.googleapis.com/chart?cht=tx&chl=P" style="vertical-align: middle;" />. Typically when we ask about the truth of a proposition in propositional calculus we've already made an assignment of truth values to the atomic propositions. So the base case for the recursion is straightforward.) <br />We can illustrate the process with a diagram: <br /><a href="http://1.bp.blogspot.com/-MtdQbPOOZWg/Ula_Cpkz76I/AAAAAAAAEQ0/VTie_xMlRNo/s1600/tree1.png" imageanchor="1"><img border="0" src="http://1.bp.blogspot.com/-MtdQbPOOZWg/Ula_Cpkz76I/AAAAAAAAEQ0/VTie_xMlRNo/s1600/tree1.png" /></a> <br />The truth value of a node in the tree is determined by the truth of the propositions hanging underneath it. We have a parent-child relation between a proposition and its subexpressions. Recursion allows us to make a definition by defining what happens on the leaves of such a tree, and by saying how the definition at a node is built from that of its children. <br /><br /><b>Defining truth for Peano Arithmetic</b><br />We can go further and attempt this approach with Peano Arithmetic (PA). The catch is that we need to consider quantifiers. For example, consider this proposition from Peano arithmetic: <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cforall%20x.%5Cneg%28Sx%3D0%29" style="vertical-align: middle;" />. This proposition is true if and only if <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cneg%28Sx%3D0%29" style="vertical-align: middle;" /> is true whatever number we substitute for <img src="https://chart.googleapis.com/chart?cht=tx&chl=x" style="vertical-align: middle;" /> in the expression. <br /><a href="http://3.bp.blogspot.com/-B_7nYmPprKo/UlbNG_Qj4EI/AAAAAAAAERE/LKs7sGwb45Q/s1600/tree2.png" imageanchor="1"><img border="0" src="http://3.bp.blogspot.com/-B_7nYmPprKo/UlbNG_Qj4EI/AAAAAAAAERE/LKs7sGwb45Q/s1600/tree2.png" /></a> <br />The proposition at the top of the tree above is true if all of the immediate children are true and their truth is in turn determined by the truth of the propositions immediately below them. With some work this eventually leads to a perfectly good definition of truth for propositions in PA. Because we have nodes with infinitely many children we don't get an algorithm guaranteed to terminate, but that's not a problem for a definition in ZF. Note that we don't literally prove the infinitely many child propositions one at a time. Instead what happens is that to define the truth of <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cforall%20x.Q" style="vertical-align: middle;" /> we define it in terms of the truth of some infinite family of propositions all based on <img src="https://chart.googleapis.com/chart?cht=tx&chl=Q" style="vertical-align: middle;" />. ZF is perfectly good at dealing with such definitions without us having to list every element of our family explicitly. <br />Note how in this case the tree isn't the parse tree of the proposition. It's much bigger with nodes that have infinite branching. But that's fine, there's nothing about infinite branching that prevents us making a recursive definition. So we can ultimately extend the idea for defining truth in propositional calculus to include quanifiers and then all of Peano arithmetic. <br /><br /><b>Defining truth for ZF</b><br />But the approach used for PA looks like it might work perfectly well for ZF as well. For example, our definition of truth would say that <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5Cforall%20x.P%28x%29" style="vertical-align: middle;" /> is true if <img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28x%29" style="vertical-align: middle;" /> is true whatever set we substitute for <img src="https://chart.googleapis.com/chart?cht=tx&chl=x" style="vertical-align: middle;" />. In ZF there is no difficulty in defining a predicate that uses quantification over all sets. So it seems we can define <img src="https://chart.googleapis.com/chart?cht=tx&chl=True" style="vertical-align: middle;" /> for ZF in ZF, contradicting Tarski's theorem. <br /><br /><b>What went wrong?</b><br />Recursive definitions typically rely on the parent-child relation I mentioned above. To recursively define something we (1) define it for all leaves and then (2) specify how the definition at a parent is given in terms of the value for all of its children. We then invoke a recursion theorem of some sort to show how this uniquely defines our object for everything in our universe. For example, one form of recursion in Peano arithmetic has <img src="https://chart.googleapis.com/chart?cht=tx&chl=0" style="vertical-align: middle;" /> as its leaf and the only child of <img src="https://chart.googleapis.com/chart?cht=tx&chl=S%28n%29" style="vertical-align: middle;" /> is <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align: middle;" />. The induction axiom for PA can be used to show that definitions using this parent-child relation are valid. <br />Similarly in ZF we have the empty set as leaf and the children of a set are simply its elements. But now we need to look closely at the recursion principle we need. For ZF we need to invoke the <a href="http://en.wikipedia.org/wiki/Transfinite_induction#Transfinite_recursion">Transfinite Recursion Theorem</a>. Transfinite recursion is very powerful. It's not just limited to induction over sets. It can also be used for induction over classes. For example if you need to recursively define a function on the class of all sets it can allow this. (Strictly speaking it'll be a <i>function class</i> rather than a function.) But now comes the catch. If you take a look at the Wikipedia article it mentions that the parent-child relation, <img src="https://chart.googleapis.com/chart?cht=tx&chl=R" style="vertical-align: middle;" />, needs to be <i>set-like</i> (though as the article is currently written it's almost an afterthought). For this theorem to apply we need the collection of children of a proposition to form a set. But to prove the truth of a proposition with a quantifier at the front we need to prove something is true for all children where there is one child for each set. This means the children don't form a set. So we can't use transfinite recursion. And this means the informal definition of truth I gave above can't be turned into a rigorous definition. <br /><br /><b>Conclusion</b><br />I think this issue is quite subtle. It's really easy to say in English "this thing is true if that thing is true for all sets". Such a sentence in isolation can often be turned into a rigorous proposition in ZF. But if that sentence is part of a collection of sentences that refer to each other forming an attempt at a mutually recursive definition, you need to check precisely what parent-child relation you're using. <br /><br /><b>Acknowledgement</b><br />Thanks to <a href="http://mathoverflow.net/users/3902/sridhar-ramesh">Sridar Ramesh</a> for making clear to me why the attempted definition of truth in ZF doesn't work. But I've probably made some mistakes above and they have nothing to do with Sridar.sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com5tag:blogger.com,1999:blog-11295132.post-66458143520873744132013-04-14T18:08:00.003-07:002013-04-14T18:08:36.649-07:00Why Heisenberg can't stop atomic collapse<h4>TL;DR</h4><div>A heuristic argument to show that hydrogen atoms are stable and have a minimum energy level is wrong. I will assume undergraduate level quantum mechanics in the discussion.</div><h4>Introduction</h4>There's a popular argument used to explain why atoms are stable. It shows there is a lower bound on the energy level of an electron in the atom that makes it impossible for electrons to keep "falling" forever all the way down to the nucleus. You'll <a href="http://www.chem1.com/acad/webtut/atomic/WhyTheElectron.html">find</a> it not only in popular <a href="http://books.google.com/books?id=KsnbNL_rh04C&lpg=PP1&pg=SA2-PA6#v=onepage&q=compromise&f=false">science</a> <a href="http://books.google.com/books?id=-vhCqN2twGQC&lpg=PA634&dq=hydrogen%20energy%20level%20uncertainty%20principle&pg=PA102#v=onepage&q=uncertainty%20principle&f=false">books</a> but in <a href="http://web.njit.edu/~gary/321/Lecture5.html">courses</a> and textbooks on <a href="http://www.pha.jhu.edu/~rt19/hydro/node1.html">quantum</a> mechanics.<br /><br />A rough version of the argument goes like this: the closer an electron falls towards the nucleus the lower its potential energy gets. But the more closely bound to the nucleus it is, the more accurately we know its position and hence, by Heisenberg's uncertainty principle (HUP), the less accurately we know its momentum. Increased variance in the momentum corresponds to an increase in kinetic energy. Eventually the decrease in potential energy as the electron falls is balanced by an increase in kinetic energy and the electron has reached a stable state.<br /><br />The problem is, this argument is wrong. It's wrong related to the kind of heuristic reasoning about wavefunctions that I've talked about <a href="http://blog.sigfpe.com/2013/01/aliasing-and-heisenberg-uncertainty.html">before</a>.<br /><br />Before showing it's wrong, let's make the argument a bit more rigorous.<br /><h4>Bounding wavefunctions</h4><div class="separator" style="clear: both; text-align: left;">The idea is to show that for any possible normalised wavefunction ψ of an electron in a Coulomb potential, the expected energy is bounded below by some constant. So we need to show that</div><div class="separator" style="clear: both; text-align: center;"><a href="http://chart.apis.google.com/chart?cht=tx&chl=\langle\psi|E|\psi\rangle" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://chart.apis.google.com/chart?cht=tx&chl=\langle\psi|E|\psi\rangle" /></a></div><div class="" style="clear: both; text-align: left;">is bounded below where</div><div class="separator" style="clear: both; text-align: center;"><a href="http://chart.apis.google.com/chart?cht=tx&chl=E=\frac{p^2}{2m}%2D\frac{e^2}{4\pi\epsilon_0r}" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://chart.apis.google.com/chart?cht=tx&chl=E=\frac{p^2}{2m}%2D\frac{e^2}{4\pi\epsilon_0r}" /></a></div><div class="" style="clear: both; text-align: left;">and <i>p</i> is momentum.</div><div class="" style="clear: both; text-align: left;">Consider a wavefunction that is confined mainly around the nucleus so</div><div class="" style="clear: both; text-align: left;"><div class="separator" style="clear: both; text-align: center;"><a href="http://chart.apis.google.com/chart?cht=tx&chl=\langle\psi|x^2|\psi\rangle=a^2" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://chart.apis.google.com/chart?cht=tx&chl=\langle\psi|x^2|\psi\rangle=a^2" style="cursor: move;" /></a></div><br /></div><div class="" style="clear: both; text-align: left;">The first fact we need is that <a href="http://en.wikipedia.org/wiki/Uncertainty_principle">Heisenberg uncertainty principle</a> tells us that </div><div class="separator" style="clear: both; text-align: center;"><a href="http://chart.apis.google.com/chart?cht=tx&chl=\langle\psi|x^2|\psi\rangle\langle\psi|p^2|\psi\rangle\ge\hbar^2/4" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://chart.apis.google.com/chart?cht=tx&chl=\langle\psi|x^2|\psi\rangle\langle\psi|p^2|\psi\rangle\ge\hbar^2/4" /></a></div>(assuming we're in a frame of reference where the expected values of <i>p</i> and <i>x</i> are zero).<br /><br />If the wavefunction is spread out with a standard deviation of <i>a</i> then the electron is mostly around a distance <i>a</i> from the nucleus. So the second fact is that we can roughly approximate the expected value of 1/<i>r</i> as 1/<i>a</i>.<br /><br />Combine these two facts and we get, roughly, that<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://chart.apis.google.com/chart?cht=tx&chl=\langle\psi|E|\psi\rangle\ge\frac{\hbar^2}{8ma^2}-\frac{e^2}{4\pi\epsilon_0a}" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://chart.apis.google.com/chart?cht=tx&chl=\langle\psi|E|\psi\rangle\ge\frac{\hbar^2}{8ma^2}-\frac{e^2}{4\pi\epsilon_0a}" style="cursor: move;" /></a></div>I hope you can see that the right hand side, as a function of <i>a</i>, is bounded below. The graph of the right hand side as a function of <i>a</i> looks like:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-zoXWqteM-gM/UUJswRdpssI/AAAAAAAACFU/VAZ4W8awVN0/s1600/potential.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="206" src="http://4.bp.blogspot.com/-zoXWqteM-gM/UUJswRdpssI/AAAAAAAACFU/VAZ4W8awVN0/s320/potential.png" width="320" /></a></div>It's now an exercise in calculus to find a lower bound on the expected energy. You can find the details in countless places on the web. Here a link to an example from <a href="http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-007-electromagnetic-energy-from-motors-to-lasers-spring-2011/lecture-notes/MIT6_007S11_lec38.pdf">MIT</a>, which may have come directly from Feynman's Lectures on Physics.<br /><br /><h4>The problem</h4><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-HV8KT3aarrg/UUJvvFtF8MI/AAAAAAAACFc/3BIZYD2AbG0/s1600/fuzz.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="http://1.bp.blogspot.com/-HV8KT3aarrg/UUJvvFtF8MI/AAAAAAAACFc/3BIZYD2AbG0/s320/fuzz.png" width="320" /></a></div>The above discussion assumes that the wavefunction is basically a single packet confined around a distance <i>a</i> from the nucleus, something like that graphed above. But if a lower energy state can be found with a different wavefunction the electron will eventually find it, or an even lower energy state. In fact, by using a wavefunction with multiple peaks we will find that the Heisenberg uncertainty principle doesn't give a lower bound at all.<br /><br />We'll use a wavefunction like this:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-98fhKE2Xkc0/UUJw1O7PikI/AAAAAAAACFk/_YSzdFb7e7c/s1600/double.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="http://4.bp.blogspot.com/-98fhKE2Xkc0/UUJw1O7PikI/AAAAAAAACFk/_YSzdFb7e7c/s320/double.png" width="320" /></a></div>It has a packet around the origin just like before but it also has a sharp peak around <i>r</i>=<i>l</i>. As I'm showing ψ as a function of <i>r</i> this means we have a shell of radius <i>l</i>.<br /><br /><div class="separator" style="clear: both;"><br class="Apple-interchange-newline" /></div>Let's say<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://images-onepick-opensocial.googleusercontent.com/gadgets/proxy?container=onepick&gadget=a&rewriteMime=image%2F*&url=http%3A%2F%2Fchart.apis.google.com%2Fchart%3Fcht%3Dtx%26chl%3D%5Cpsi%3D%5Cfrac%7B1%7D%7B%5Csqrt%7B2%7D%7D(%5Cpsi_1%252B%5Cpsi_2)" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="http://images-onepick-opensocial.googleusercontent.com/gadgets/proxy?container=onepick&gadget=a&rewriteMime=image%2F*&url=http%3A%2F%2Fchart.apis.google.com%2Fchart%3Fcht%3Dtx%26chl%3D%5Cpsi%3D%5Cfrac%7B1%7D%7B%5Csqrt%7B2%7D%7D(%5Cpsi_1%252B%5Cpsi_2)" /></a></div><br />where ψ<sub>1</sub> is normalized and peaked near the original and ψ<sub>2 </sub>is our shell of radius <i>l</i>. Assume no overlap between ψ<sub>1</sub> and ψ<sub>2</sub>.<br /><br />In this case you can see that we can make<br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://images-onepick-opensocial.googleusercontent.com/gadgets/proxy?container=onepick&gadget=a&rewriteMime=image%2F*&url=http%3A%2F%2Fchart.apis.google.com%2Fchart%3Fcht%3Dtx%26chl%3D%5Clangle%5Cpsi%7Cx%5E2%7C%5Cpsi%5Crangle%3D%5Cfrac%7B1%7D%7B2%7D(%5Clangle%5Cpsi_1%7Cx%5E2%7C%5Cpsi_1%5Crangle%252B%5Clangle%5Cpsi_2%7Cx%5E2%7C%5Cpsi_2%5Crangle)" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://images-onepick-opensocial.googleusercontent.com/gadgets/proxy?container=onepick&gadget=a&rewriteMime=image%2F*&url=http%3A%2F%2Fchart.apis.google.com%2Fchart%3Fcht%3Dtx%26chl%3D%5Clangle%5Cpsi%7Cx%5E2%7C%5Cpsi%5Crangle%3D%5Cfrac%7B1%7D%7B2%7D(%5Clangle%5Cpsi_1%7Cx%5E2%7C%5Cpsi_1%5Crangle%252B%5Clangle%5Cpsi_2%7Cx%5E2%7C%5Cpsi_2%5Crangle)" /></a></div><div class="separator" style="clear: both; text-align: left;">as large as we like by making <i>l</i> as large as we like while still leaving us free to make the central peak whatever shape we want. This means that the estimate of </div><div class="separator" style="clear: both; text-align: center;"><a href="http://images-onepick-opensocial.googleusercontent.com/gadgets/proxy?container=onepick&gadget=a&rewriteMime=image%2F*&url=http%3A%2F%2Fchart.apis.google.com%2Fchart%3Fcht%3Dtx%26chl%3D%5Clangle%5Cpsi%7Cp%5E2%7C%5Cpsi%5Crangle" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://images-onepick-opensocial.googleusercontent.com/gadgets/proxy?container=onepick&gadget=a&rewriteMime=image%2F*&url=http%3A%2F%2Fchart.apis.google.com%2Fchart%3Fcht%3Dtx%26chl%3D%5Clangle%5Cpsi%7Cp%5E2%7C%5Cpsi%5Crangle" /></a></div><div class="separator" style="clear: both; text-align: left;">coming from HUP can be made as small as we like while making the central peak as close to a Dirac delta as we want. Informally, HUP controls of the overall spread of the wave function but not the spread of individual peaks within it.</div><div class="separator" style="clear: both; text-align: left;"><br /></div>For a large enough shell, ψ<sub>2 </sub>contributes little to the total expected potential energy, but ψ<sub>1</sub> can contribute an arbitrarily low amount because we can concentrate it in areas where 1/<i>r</i> is as large as we want. So we can make the total expected potential energy as low as we like. And yet we can also keep the estimate of the kinetic energy given by HUP as close to zero as we like. So contrary to the original argument, the Heisenberg uncertainty principle doesn't give us a lower bound on the energy at all. The argument is wrong.<br /><h4>But wait, we know there is a lowest energy state...</h4><div>Yes, the energy of a wavefunction in a Coulomb potential is in fact bounded below. After all, atoms are stable. But the Heisenberg uncertainty principle doesn't show it. The inequality in HUP becomes an equality when the wavefunction is a Gaussian function. It provides a good bound for functions that are roughly Gaussian, ie. that form a single "lump". But it provides only weak bounds for wavefunctions with multiple peaks and in this case it's not the appropriate tool to use.</div><div><br /></div><div>The Heisenberg uncertainty principle is an inequality about ordinary functions interpreted in the context of quantum mechanics (QM). The field of functional analysis provides many such inequalities. A great paper by Lieb, <a href="http://www.pas.rochester.edu/~rajeev/phy246/lieb.pdf">The Stability of Matter</a>, gives an inequality due to Sobolev that can also be interpreted in the context of QM. Sobolev's inequality is more appropriate when considering the hydrogen atom and it gives a good lower bound, demonstrating that the hydrogen atom is stable after all.</div><h4>But wait, the Heisenberg uncertainty principle argument gives the right energy...</h4><div>Getting a correct answer doesn't always justify the methods. I can give at least two reasons why the original method appears to work.</div><div><br /></div><div>1. The HUP gives a good bound for wavefunctions that are roughly Gaussian. The lowest energy level for the hydrogen atom is given (very roughly) by such a function. So an estimate based on HUP should be roughly correct. However, HUP alone can't tell us that the lowest energy state is Gaussian. The argument is only useful if we can get this information from somewhere else.</div><div><br /></div><div>2. You can get an estimate for the lowest energy level of the hydrogen atom (assuming it exists) by dimensional analysis. Invalid physical arguments that are dimensionally correct will often give the correct result because there is only one dimensionally correct expression possible.</div><h4>But wait, it's just a heuristic argument...</h4><div>Heuristic arguments are crucial to physics. But when similar heuristic arguments give opposite results they become problematic. In particular, it's no good saying an argument is inexact or qualitative when it gives a bound on the energy that isn't just off by an order of magnitude, but completely fails to give a bound at all. Part of the issue here is that the Coulomb potential goes to infinity as <i>r</i> goes to zero and so more care is required. The HUP argument above can be adapted to give good results when the potential is bounded below, for example it gives a reasonable estimate for square wells.<br /><br />But there may be a clever way of using HUP to bound the energy that I haven't seen. If you can see it, please tell me.<br /><h4>The source</h4>Most of what I said above I learnt from the excellent paper on the <a href="http://www.pas.rochester.edu/~rajeev/phy246/lieb.pdf">Stability of Matter</a> by Lieb that I mentioned above.</div>sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com10tag:blogger.com,1999:blog-11295132.post-52725081900210555172013-01-13T16:59:00.002-08:002013-01-13T16:59:59.651-08:00Aliasing and the Heisenberg uncertainty principle.<div><b>TL;DR</b><br /><b><br /></b>The<a href="http://en.wikipedia.org/wiki/Dirac_comb"> Dirac comb</a> is an example of a wavefunction whose position and momentum aren't fuzzy.<br /><b><br /></b><b>Introduction</b></div><div><br /></div>The Heisenberg uncertainty principle says that if you have a particle in some state and observe either its momentum or its position then the products of the standard deviations of distributions of the outcomes satisfy this identity:<br /><div><div class="separator" style="clear: both; text-align: center;"><a href="http://chart.apis.google.com/chart?cht=tx&chl=\Delta_x\Delta_p\ge{\hbar\over2}" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://chart.apis.google.com/chart?cht=tx&chl=\Delta_x\Delta_p\ge{\hbar\over2}" /></a></div></div><div><br /></div><div>I think many people have a mental picture a bit like this:</div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-0QAKEv2RDj4/UMIToJcuihI/AAAAAAAABvs/8ZebaYdJteg/s1600/clouds.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="144" src="http://3.bp.blogspot.com/-0QAKEv2RDj4/UMIToJcuihI/AAAAAAAABvs/8ZebaYdJteg/s320/clouds.png" width="320" /></a></div><div><div class="separator" style="clear: both; text-align: center;"><br /></div>You can know the position and momentum with some degree of fuzziness and you can trade the fuzziness between the two measurements as long as the product of their sizes is larger than ℏ/2.</div><div><br /></div><div>Here's another way of thinking about that kind of picture (assuming some units I haven't specified):</div><div><br /><center><div><table><tbody><tr><td>position</td><td>=</td><td>123.4???</td></tr><tr><td>momentum</td><td>=</td><td>65?.???</td></tr></tbody></table></div></center></div><div>The idea is that the question mark represents digits we don't know well. As you move towards the right in the decimal representation our certainty in the accuracy of the digit quickly goes downhill to the point where we can't reasonably write digits.</div><div><br /></div><div>But this picture is highly misleading. For example, the following state of affairs is also compatible with the uncertainty principle, in suitably chosen units:<br /><br /><center><table><tbody><tr><td>position</td><td>=</td><td>...???.123...</td></tr><tr><td>momentum</td><td>=</td><td>...???.654...</td></tr></tbody></table></center><br />In other words, it's compatible with the uncertainty principle that we could know the digits beyond the decimal point to as much accuracy as we like as long as we don't know the digits before the point. It trivially satisfies Heisenberg's inequality because the variance of the position and the momentum aren't even finite quantities.<br /><br />But being compatible with Heisenberg uncertainty isn't enough for something to be realisable as a physical state. Is there a wavefunction that allows us to know the digits to the right of the decimal point as far as we want for both position and momentum measurements?<br /><br /><b>Sampling audio and graphics</b><br /><br />Maybe surprisingly, the worlds of audio and graphics can help us answer this question. Here's what a fraction of a second of music might look like when the pressure of the sound wave is plotted against time:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-sRMmpxbJ0QY/UPM-92ABHQI/AAAAAAAAB74/7hnsWP6muPQ/s1600/waveform.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="182" src="http://2.bp.blogspot.com/-sRMmpxbJ0QY/UPM-92ABHQI/AAAAAAAAB74/7hnsWP6muPQ/s320/waveform.png" width="320" /></a></div><br />But if we sample this signal at regular intervals, eg. at 44.1KHz for a CD, then we can graph the resulting signal as something like this:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-s0DrZ4EzU98/UPM-9xCZzCI/AAAAAAAAB70/c7sE9IFyH7I/s1600/sampled.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="182" src="http://3.bp.blogspot.com/-s0DrZ4EzU98/UPM-9xCZzCI/AAAAAAAAB70/c7sE9IFyH7I/s320/sampled.png" width="320" /></a></div><br />The red curve here is just to show what the original waveform looked like. The black vertical lines correspond to regular samples and we can represent them mathematically with Dirac delta functions multiplied by the amplitude measured at the sample.<br /><br />There is a well known problem with sampling like this. If you sample a signal that is a sine wave sin(ωt) at rate f then the signal sin((ω+2πnf)t) will generate exactly the same samples for any integer n. The following illustration shows what might happen:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-hFSUIh_soLI/UPNRsZrICUI/AAAAAAAAB8M/cd5wUiz2Dzw/s1600/alias.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="94" src="http://3.bp.blogspot.com/-hFSUIh_soLI/UPNRsZrICUI/AAAAAAAAB8M/cd5wUiz2Dzw/s320/alias.png" width="320" /></a></div><br />The two waveforms are sampled at the same regular intervals (shown by vertical lines) and give exactly the same amplitudes at those samples.<br /><br />This forms the basis for the famous <a href="http://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem">Nyquist-Shannon sampling theorem</a>. You can reconstruct the original signal from regularly spaced samples only if it doesn't contain frequency components higher than half your sampling rate. Otherwise you get ambiguities in the form of high frequency parts of the signal masquerading as low frequency parts. This effect is known as <a href="http://en.wikipedia.org/wiki/Aliasing">aliasing</a>. As a result, the Fourier transform of a sampled function is periodic with the "repeats" corresponding to the aliasing.<br /><div class="separator" style="clear: both; text-align: center;"><br /></div>In the audio world you need to filter your sound to remove the high frequencies before you sample. This is frequently carried out with an analogue filter. In the 3D rendering world you need to do something similar. Ray tracers will send out many rays for each pixel, in effect forming a much higher resolution image than the resolution of the final result, and that high resolution image is filtered before being sampled down to the final resulting image. The "jaggies" you get from rendering polygons are an example of this phenomenon. It seems like jaggies have nothing to do with the world of Fourier transforms. But if you compute the Fourier transform of a polygonal image, remove suitable high frequency components, and then take the inverse Fourier transform before sampling you'll produce an image that's much more pleasing to the eye. In practice there are shortcuts to achieving much the same effect.<br /><br /><b>The connection to physics</b><br /><br />Now consider a particle whose wavefunction takes the form of the <a href="http://en.wikipedia.org/wiki/Dirac_comb">Dirac comb</a>:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-VhZftFEIMuI/UPNS_C_zdqI/AAAAAAAAB8k/Z7hR12vVwUE/s1600/comb.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="85" src="http://2.bp.blogspot.com/-VhZftFEIMuI/UPNS_C_zdqI/AAAAAAAAB8k/Z7hR12vVwUE/s320/comb.png" width="320" /></a></div><br />This is a wavefunction that is concentrated at multiples of some quantity a, ie. ∑δ(x-an) summing over n = ...,-1,0,1,2,... If the wavefunction is ψ(x) then the probability density function for the particle position is |ψ(x)|². So the particle has a zero probability of being found at points other than those where x=na. In other words, modulo a, the particle position is given precisely.<br /><br />But what about the particle momentum. Well the wavefunction has, in some sense, been sampled onto the points na, so we expect that whatever the momentum distribution is it'll be ambiguous modulo b where ab=ℏ. In fact, if we take the Fourier transform of the Dirac comb we get <a href="http://en.wikipedia.org/wiki/Dirac_comb#Fourier_transform">another Dirac comb</a>. So in the frequency domain we get the same kind of phenomenon: the momentum is concentrated at integer multiples of b. So now we know we have a wavefunction whose uncertainty precisely fits the description I gave above. We know the position precisely modulo a and the momentum precisely modulo b. In some sense this isn't contrived: we know the momentum modulo b precisely <i>because</i> of the aliasing that results from knowing the position modulo a.<br /><br /><b>What this means</b><br /><br />The message from this is that position-momentum uncertainty isn't fuzziness. At least it's not fuzziness in the ordinary sense of the word.<br /><br /><b>And in reality</b><br /><br />I'm not very experienced in attaching numbers to results from theoretical physics so I'd find it hard to say how accurately we can create a Dirac comb state in reality. When we measure a position using interferometry techniques we automatically compute the position modulo a wavelength so this isn't an unusual thing to do. Also an electron in a periodic potential may take on a form that consists of a train of equally spaced lumps. Even if not described exactly by a Dirac comb, we can still know the position modulo a and the momentum modulo b much more accurately than you might expect from a naive interpretation of the Heisenberg uncertainty principle as fuzziness.<br /><br /><b>Exercises</b><br />1. Investigate approximations to the Dirac comb: eg. what happens if we sum only a finite number of Dirac deltas, or replace each delta with a finite width Gaussian, or both.<br />2. Investigate the "twisted" Dirac comb: ∑δ(x-an)exp(inθ) where θ is some constant.</div><div></div>sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com9tag:blogger.com,1999:blog-11295132.post-36812113399542983282012-12-30T08:49:00.000-08:002012-12-30T12:21:47.001-08:00Shuffles, Bayes' theorem and continuations.<b>Introduction</b><br />Back in the 80s when I was a kid I came across a <a href="http://www.atomclub.nl/atom_plaatjes/boeken/pracprog/pp_card_trick.htm">program</a> for the BBC Micro that could tell what card you had picked from a deck of cards even though you'd buried your card within the deck wherever you wanted and had cut and shuffled the deck. I thought I'd try to implement a slightly more sophisticated version of the same trick that could handle multiple shuffles, and multiple types of shuffle. <br />The idea is that we prepare a deck of cards in some known sequence and have someone pick out a card and place it at the top of the deck. They perform some kind of randomisation procedure on the cards, eg. cut and shuffle it a couple of times, and then you get to look at the final sequence of cards. Can we tell which card was picked out? <br /><b>Some probability theory</b><br />Let's formalise this a bit. Our decks will have <img src="https://chart.googleapis.com/chart?cht=tx&chl=N" style="vertical-align: middle;" /> cards. There is a small number of initial states our deck can be in, corresponding to the known sequence with a single card moved to the top. Let's label these initial states <img src="https://chart.googleapis.com/chart?cht=tx&chl=I_i" style="vertical-align: middle;" />. There is a (usually large) number of permutations that could be applied through shuffling. We'll label these <img src="https://chart.googleapis.com/chart?cht=tx&chl=T_j" style="vertical-align: middle;" />. We'll try to do arrange that this isn't simply the set of all permutations (though it's not necessarily a disaster if it is). <br />We want to figure out the initial state given some final state <img src="https://chart.googleapis.com/chart?cht=tx&chl=F" style="vertical-align: middle;" />. In other words we want <br /><blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28I_i%7CF%29" style="vertical-align: middle;" /></blockquote>We can use Bayes theorem to get: <br /><blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28I_i%7CF%29%20%3D%20%7BP%28F%7CI_i%29P%28I_i%29%20%5Cover%20%5Csum_i%20P%28F%7CI_i%29P%28I_i%29%7D" style="vertical-align: middle;" /></blockquote>Now <img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28F%7CI_i%29" style="vertical-align: middle;" /> is the sum over all ways of starting with <img src="https://chart.googleapis.com/chart?cht=tx&chl=I_i" style="vertical-align: middle;" /> and ending up with <img src="https://chart.googleapis.com/chart?cht=tx&chl=F" style="vertical-align: middle;" />. So <br /><blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28I_i%7CF%29%20%3D%20%5Csum_%7Bj%7DP%28T_j%29P%28I_i%29" style="vertical-align: middle;" /></blockquote>where the sum is over all <img src="https://chart.googleapis.com/chart?cht=tx&chl=j" style="vertical-align: middle;" /> such that <img src="https://chart.googleapis.com/chart?cht=tx&chl=T_j%28I_i%29%3DF" style="vertical-align: middle;" />. I'm assuming that the shuffles are independent of the initial sequence of cards. This gives us an algorithm. We do a brute force simulation of every possible shuffle that we're considering applied to each possible initial state. After each shuffle we sum the corresponding probability for those shuffles that give our known final state <img src="https://chart.googleapis.com/chart?cht=tx&chl=F" style="vertical-align: middle;" />. <br />Each shuffle <img src="https://chart.googleapis.com/chart?cht=tx&chl=T_j" style="vertical-align: middle;" /> is going to be built up as the product of a sequence of building blocks with each block randomly selected based on what happened before. Let's call the blocks names like <img src="https://chart.googleapis.com/chart?cht=tx&chl=B_k" style="vertical-align: middle;" />. So if <img src="https://chart.googleapis.com/chart?cht=tx&chl=T_j%3DB_3B_2B_1" style="vertical-align: middle;" /> then <img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28T_j%29%3DP%28B_1%29P%28B_2%7CB_1%29P%28B_3%7CB_1%2CB_2%29" style="vertical-align: middle;" />. As we work through the shuffle we will accumulate the probability. After the first block we have a probability of <img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28B_1%29" style="vertical-align: middle;" />. The probability after the second is <img src="https://chart.googleapis.com/chart?cht=tx&chl=P%28B_1%29P%28B_2%7CB_1%29" style="vertical-align: middle;" /> and so on. At any point we'll call the probability accumulated so far the <i>importance</i>. I've borrowed that name from the <a href="http://www.seanet.com/~myandper/importance.htm">world of rendering</a> because this algorithm has a remarkable similarity to recursive ray-tracing. <br /><b>Some computer science</b><br />I'd like to be able to chain a sequence of shuffles. But wait! There's a catch! Today's the day I finally want to get around to checking out the lambda expression support in C++. I've been putting this off for years. (I'm using gcc 4.7.) So I'm not going to have a Haskell non-determinism monad to make life easy. <br />Suppose I have two types of shuffle, type <img src="https://chart.googleapis.com/chart?cht=tx&chl=A" style="vertical-align: middle;" /> and type <img src="https://chart.googleapis.com/chart?cht=tx&chl=B" style="vertical-align: middle;" />. I could easily write a loop to iterate over all shuffles of type <img src="https://chart.googleapis.com/chart?cht=tx&chl=A" style="vertical-align: middle;" />, and in the innermost part of the loop I could call another loop over all shuffles of type <img src="https://chart.googleapis.com/chart?cht=tx&chl=B" style="vertical-align: middle;" />. But then if I want to replace <img src="https://chart.googleapis.com/chart?cht=tx&chl=B" style="vertical-align: middle;" /> with <img src="https://chart.googleapis.com/chart?cht=tx&chl=C" style="vertical-align: middle;" /> I have to change the code to replace the inner part with code for <img src="https://chart.googleapis.com/chart?cht=tx&chl=C" style="vertical-align: middle;" />. That's no good. I'd like to be able to replace the innermost part of the outer loop with any code I want without actually editing that part of the code. It's easy with lambda expressions. I write the type <img src="https://chart.googleapis.com/chart?cht=tx&chl=A" style="vertical-align: middle;" /> loop code so that it takes as argument a lambda function representing what I want done inside the loop. <br />There's another way of looking at this. You can skip this paragraph if you don't care about the connection to Haskell. But in Haskell you might do something like this by using a non-determinism monad, or even a probability monad. But as I pointed out a while back, you can fake every monad using the <a href="http://blog.sigfpe.com/2008/12/mother-of-all-monads.html">continuation monad</a>. One way to implement continuations in C++ is to use continuation passing style. And that's what I'll do. The continuations are just the lambdas that I mentioned in the previous paragraph. <br /><b>Some C++ code</b><br /><br /><br /><pre>> #include <iostream><br />> #include <cstdlib><br />> using namespace std;<br /></pre>You can bump this up the deck size if you have the CPU power: <br /><pre>> const int deck_size = 13;<br /></pre>A deck of cards is represented by a simple array of integers with each card being assigned a unique integer. <br /><pre>> struct Deck {<br />> int card[deck_size];<br />> bool operator==(const Deck &other) {<br />> for (int i = 0; i < deck_size; ++i) {<br />> if (card[i] != other.card[i]) {<br />> return false;<br />> }<br />> }<br />> return true;<br />> }<br />> };<br /></pre>The riffle shuffle works by splitting a deck into two piles and interleaving the parts onto a new destination deck. Here's a schematic diagram with the two piles coloured orange and blue: <a href="http://1.bp.blogspot.com/-0l2lmJO0xwc/UN_dyQZTuzI/AAAAAAAAB2k/wHjAx43U3Q4/s1600/riffle.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-0l2lmJO0xwc/UN_dyQZTuzI/AAAAAAAAB2k/wHjAx43U3Q4/s1600/riffle.png" /></a> <br />The function <tt>riffle_helper</tt> helps loop through all possible riffles. I could assume that each card arriving at the destination is equally likely to come from the left pile or the right pile. But I observe that whenever I do a real riffle shuffle the cards seem to come in 'runs'. So if a card falls from the left pile then the next one is more likely to as well. That's just an empirical observation based on a small number of trials, you can tweak the probabilities yourself to fit reality better. (Oh, and I got this code upside-down compared to what people really do. I need to fix it when I have a moment...) <br /><br /><pre>> enum Side {<br />> LEFT,<br />> RIGHT,<br />> NO_SIDE<br />> };<br /></pre>This function shuffles together cards from the locations given by <tt>left_ptr</tt> and <tt>right_ptr</tt> in <tt>src_deck</tt> into <tt>dest_deck</tt>, eventually calling <tt>cont</tt> on each result. I use a template because I don't know the type of the lambda expression I'm passing in. (If I want to know its type I think I have to mess with <tt>decltype</tt>. It's all a bit weird.) <br /><pre>> template<class Cont><br />> void riffle_helper(double importance, int split,<br />> int left_ptr, int right_ptr, int dest_ptr, Side oldside,<br />> const Deck &src_deck, Deck dest_deck, Cont cont) {<br />> if (dest_ptr == deck_size) {<br />> cont(importance, dest_deck);<br />> return;<br />> }<br /></pre>First I deal with the cases where one or other of the piles is empty so there's no choice about where the next card is coming from: <br /><pre>> if (left_ptr >= split) {<br />> dest_deck.card[dest_ptr] = src_deck.card[right_ptr];<br />> riffle_helper(importance, split, left_ptr, right_ptr+1, dest_ptr+1, RIGHT, src_deck, dest_deck, cont);<br />> return;<br />> }<br />> if (right_ptr >= deck_size) {<br />> dest_deck.card[dest_ptr] = src_deck.card[left_ptr];<br />> riffle_helper(importance, split, left_ptr+1, right_ptr, dest_ptr+1, LEFT, src_deck, dest_deck, cont);<br />> return;<br />> }<br />> double p;<br />> if (oldside == NO_SIDE) {<br />> p = 0.5;<br />> } else {<br />> p = LEFT == oldside ? 0.75 : 0.25;<br />> }<br />> double new_importance = importance*p;<br />> dest_deck.card[dest_ptr] = src_deck.card[left_ptr];<br />> riffle_helper(new_importance, split, left_ptr+1, right_ptr, dest_ptr+1, LEFT, src_deck, dest_deck, cont);<br /><br /><br />> if (oldside == NO_SIDE) {<br />> p = 0.5;<br />> } else {<br />> p = RIGHT == oldside ? 0.75 : 0.25;<br />> }<br />> new_importance = importance*p;<br />> dest_deck.card[dest_ptr] = src_deck.card[right_ptr];<br />> riffle_helper(new_importance, split, left_ptr, right_ptr+1, dest_ptr+1, RIGHT, src_deck, dest_deck, cont);<br />> }<br /><br /><br /></pre>The function <tt>riffle</tt> iterates over all possible riffle shuffles of <tt>src_deck</tt> calling <tt>cont</tt> on each one. Note that I assume that when the deck is split into two before shuffling together, each pile has at least 3 cards. You may want to change that assumption. <br /><pre>> template<class Cont><br />> void riffle(double importance, const Deck &src_deck, Cont cont) {<br />> double new_importance = importance/(deck_size-5);<br />> for (int split = 3; split < deck_size-2; ++split) {<br />> riffle_helper(new_importance, split, 0, split, 0, NO_SIDE, src_deck, Deck(), cont);<br />> }<br />> }<br /></pre>Iterate over all possible cuts of <tt>src_dec</tt> calling <tt>cont</tt> on each result. I assume the cut leaves at least 3 cards in each pile. <br /><pre>> template<class Cont><br />> void cut(double importance, const Deck &src_deck, Cont cont) {<br />> double new_importance = importance/(deck_size-5);<br />> for (int split = 3; split < deck_size-2; ++split) {<br />> Deck new_deck;<br />> for (int i = 0; i < deck_size; ++i) {<br />> if (i < deck_size-split) {<br />> new_deck.card[i] = src_deck.card[i+split];<br />> } else {<br />> new_deck.card[i] = src_deck.card[i-(deck_size-split)];<br />> }<br />> }<br />> cont(new_importance, new_deck);<br />> }<br />> }<br /></pre>Overhand shuffle remaining cards in <tt>src_deck</tt> to <tt>dest_deck</tt>. Here's an attempt to represent what an overhand shuffle does. It reverses the order of a deck that has been split into segments. The order within each segment is left unchanged. <a href="http://1.bp.blogspot.com/-WDBfySFVWq0/UN-wow18WmI/AAAAAAAAB2U/0XCiy6OadgE/s1600/shuffle.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="179" src="http://1.bp.blogspot.com/-WDBfySFVWq0/UN-wow18WmI/AAAAAAAAB2U/0XCiy6OadgE/s400/shuffle.png" width="400" /></a> <br /><br /><pre>> template<class Cont><br />> void overhand_helper(double importance, const Deck &src_deck,<br />> int cards_left, Deck dest_deck, Cont cont) {<br />> if (cards_left <= 0) {<br />> cont(importance, dest_deck);<br />> } else {<br />> double new_importance = importance/cards_left;<br />> for (int ncards = 1; ncards <= cards_left; ++ncards) {<br />> //<br />> // Take i cards from the source and place them at the bottom of the<br />> // destination.<br />> //<br />> for (int j = 0; j < ncards; ++j) {<br />> dest_deck.card[cards_left-ncards+j] = src_deck.card[deck_size-cards_left+j];<br />> }<br />> overhand_helper(new_importance, src_deck, cards_left-ncards, dest_deck, cont);<br />> }<br />> }<br />> }<br /></pre>Iterate over all possible overhand shuffles of cards in <tt>src_deck</tt> calling <tt>cont</tt> on each result. In practice I often find overhand shuffles result in cards mysteriously jumping segments and messing up the algorithm, whereas poorly executed riffle shuffles still work fine. I'm also assuming that each time a pile of cards is transferred the size of the pile is chosen uniformly from the set of all possible segments at that stage. <br /><pre>> template<class Cont><br />> void overhand(double importance, const Deck &src_deck, Cont cont) {<br />> overhand_helper(importance, src_deck, deck_size, Deck(), cont);<br />> }<br /></pre>The final code doesn't bother computing the denominator from Bayes' theorem. The most likely initial state is given by the one that results in the highest score. If you normalise the scores to sum to one you'll get actual probabilities. <br /><pre>> int main() {<br /></pre>This is the array representation of the cards in the following picture: <br /><pre>> Deck target = {{ 10, 11, 6, 12, 1, 13, 8, 2, 9, 3, 5, 4, 7 }};<br /></pre><a href="http://1.bp.blogspot.com/-1uB8ofGbcn4/UN-no9o8zdI/AAAAAAAAB2A/M3hjS6ya0eg/s1600/photo.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="http://1.bp.blogspot.com/-1uB8ofGbcn4/UN-no9o8zdI/AAAAAAAAB2A/M3hjS6ya0eg/s400/photo.JPG" width="400" /></a> <br /><pre>> Deck deck;<br /></pre>Our known starting sequence is just 1, 2, 3, ..., J, Q, K. We iterate over all ways to pick a card out from this sequence and place it at the top. <br /><pre>> for (int k = 0; k < deck_size; ++k) {<br />> deck.card[0] = k+1;<br />> for (int i = 1; i < deck_size; ++i) {<br />> deck.card[i] = (i > k ? i : i-1)+1;<br />> }<br />> double likelihood = 0.0;<br /></pre>Here is where I use the lambdas. For this example I'm doing an overhand shuffle followed by a riffle shuffle. (The syntax is pretty bizarre and its also weird that I kinda sorta specify the type of my lambda but that's not really what the type of the expression is. But having manually faked and lifted lambdas many times in C++ I can see why it's the way it is.) Note how I've made <tt>likelihood</tt> mutable and have given these lambda expressions write access to it. <br /><pre>> overhand(1.0, deck, [&likelihood, target](double importance, Deck &deck) -> void {<br />> riffle(importance, deck, [&likelihood, target](double importance, Deck &deck) -> void {<br />> if (deck == target) {<br /></pre>We sum the probabilities for all ways of generating the target deck: <br /><pre>> likelihood += importance;<br />> }<br />> }); });<br />> cout << "If top card = " << deck.card[0] << endl;<br />> cout << "then unnormalised probability = " << likelihood << endl;<br />> cout << endl;<br />> }<br /><br /><br />> }<br /><br /><br /></pre>Run the above code and you get unnormalised probabilities <pre>If top card = 4<br />then unnormalised probability = 5.7568e-12<br />If top card = 6<br />then unnormalised probability = 5.37301e-11<br />If top card = 7<br />then unnormalised probability = 1.791e-11<br /></pre>In fact, I had chosen 6. <b>Some discussion</b><br />Don't expect it to work perfectly! It can only give probabilities but it's often surprisingly good. But there is a lot of room for improvement. Some work looking at how people actually shuffle could give a better probabilistic model. <br /><b>Some exercises.</b><br />1. The code can be made orders of magnitude faster. The final shuffle is performed and then the result is compared to the target sequence. But you can start comparing cards with the target before the shuffle is finished. Most times you'll only need to look at the first card of the result of a shuffle before you know you haven't matched the target. Fixing this will give a big speed up. <br />2. The continuation passing style makes it easy to incorporate other sources of knowledge. For example if you 'accidentally' peek at the bottom card after the first shuffle you can incorporate that knowledge into the algorithm. Figure out how. <br />3. Write lots more kinds of shuffles and experiment. I'm hoping someone good with magic will come up with a sequence of operations that looks hopelessly random but allows a good probability of recovering the chosen card. You could also combine this with other techniques such as designing shuffles that maintain various kinds of invariant. <br />4. The code can be rewritten to work backwards from the final state to the initial states. Work out how to do this. (This parallels ray-tracing where you can work from eye to light or from light to eye.) <br />5. We're doing the same work over and over again. We don't need to compute all of the shuffles for each initial state. We can compute each shuffle once and reuse it on each initial state. Try implementing it.<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com3tag:blogger.com,1999:blog-11295132.post-58597078005838938992012-11-18T16:10:00.003-08:002012-11-18T17:20:48.383-08:00A pictorial proof of the hairy ball theorem<div class="separator" style="clear: both; text-align: left;">The hairy-ball theorem says that there is no continuous non-zero vector field on the surface of a sphere. There are lots of popular accounts that tell you what this means, giving great examples. Here's a Youtube video for example:</div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' width='320' height='266' src='https://www.youtube.com/embed/B4UGZEjG02s?feature=player_embedded' frameborder='0' /></div><br /><br />My goal is to show why it's always true.<br /><br />A simply connected domain in the plane is one with the property that any loop in it can be shrunk down to a point. Here's an example of a domain D with an example loop L being shrunk down to a point P:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-9mg_veDP_28/UKlWcYqBqII/AAAAAAAABog/n8no7sctL0M/s1600/simple.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="187" src="http://4.bp.blogspot.com/-9mg_veDP_28/UKlWcYqBqII/AAAAAAAABog/n8no7sctL0M/s320/simple.png" width="320" /></a></div>Here's an example of a domain that's not simply connected. It has a hole in the middle. I've drawn a L loop around the hole. You can't shrink that loop to a point because the hole gets in the way:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-4OxymO5spyo/UKlWzaq4xoI/AAAAAAAABoo/cZzX-fvEKus/s1600/Hole.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="192" src="http://4.bp.blogspot.com/-4OxymO5spyo/UKlWzaq4xoI/AAAAAAAABoo/cZzX-fvEKus/s320/Hole.png" width="320" /></a></div>Here's a simply connected domain with a vector field on it:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-0Q_WOphBHaY/UKlXVYgxuVI/AAAAAAAABow/lk1z8ig_Oqw/s1600/vectors.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="192" src="http://4.bp.blogspot.com/-0Q_WOphBHaY/UKlXVYgxuVI/AAAAAAAABow/lk1z8ig_Oqw/s320/vectors.png" width="320" /></a></div>Think of the vectors as being drawn literally in the surface so that if we were to pick up the surface and stretch it like a piece of rubber the vectors would get stretched with it. Remember that a vector field is defined everywhere in the domain so the arrows are just a random sprinkling of examples to show what's going on. For this to be an accurate picture you want to imagine an infinity of arrows, one at every single point of the domain.<br /><br />Let's put a loop, starting and ending at P, in our simply-connected domain:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-qpBTzqOxnic/UKlX3TG89yI/AAAAAAAABo4/l-ppnCXrh_g/s1600/loop.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="192" src="http://4.bp.blogspot.com/-qpBTzqOxnic/UKlX3TG89yI/AAAAAAAABo4/l-ppnCXrh_g/s320/loop.png" width="320" /></a></div>Now imagine travelling along the loop, starting at P and ending at P. As you move along there's an arrow at each point in your journey. Here's what the arrows look like as you travel from P to P anti-clockwise, plotted as a kind of graph:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-6XAYbR1UPwk/UKlYjzLekRI/AAAAAAAABpA/fJVU0IJt8nI/s1600/graph.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="65" src="http://3.bp.blogspot.com/-6XAYbR1UPwk/UKlYjzLekRI/AAAAAAAABpA/fJVU0IJt8nI/s320/graph.png" width="320" /></a></div>The vectors start off pointing to the right. They swing anti-clockwise by about 45º and then swing back to where they started. As the journey is a loop they clearly must end where they started. A different, really swirly vector field, might have resulted in arrows that that rotated around hundreds of times along your journey. But by time you reach the end of the journey they must swing back to where they started. What's slightly less obvious is that they'd also have to rotate back to cancel out the hundreds of swings. You might think "the vectors could rotate round a hundred times but as long as they make exactly 100 turns they'll return to where they started and there's no need for them to unwind". But actually, every bit of rotation in the journey must be unwound. The total amount of rotation, adding all the positive rotations, and subtracting off the negative rotations, is called the winding number for the loop. We count anti-clockwise rotation as positive and clockwise as negative. So I'm claiming that the winding number for a closed loop in a simply-connected domain is always zero.<br /><br />(Note: in most books the winding number normally refers to how many times the loop itself winds around a point. I'm using it to refer to how many times the vector winds around itself you follow the loop. To help with your intuition: the hour hand of a working clock normally accumulates a winding number of -2 in one day. If it ran forward for a day, but then ran backwards for half a day, the winding number would be -1.)<br /><br />Here's why the winding number for simply connected domains must be zero: firstly - it's pretty clear that the winding number for any loop must be an integer. If the winding number was a half, say, the arrow wouldn't end up pointing 180º from where it started which makes no sense for a closed loop. Now the domain is simply connected, so the loop can be shrunk to a point. Now imagine doing the shrinking really slowly and keeping track of the winding number as the loop shrinks. As the loop shrinks, the graph of the vectors along the loop must vary slowly. The total winding number depends continuously on the vectors in the graph so the winding number must vary slowly as the loop shrinks. But the winding number is an integer. It can't change really slowly, it can only change by amounts of a whole integer. So the winding number can't change at all. Every loop in a simply-connected domain must have a winding number that's the same as the winding number of a loop that is just one point ie. zero.<br /><br />On to the sphere. Here's a sphere with a vector field where all of the vectors point along lines of longitude to the north pole:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-uIbgVRNRkgs/UKlbgHPxQyI/AAAAAAAABpQ/CIzzbfBlg98/s1600/north.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://1.bp.blogspot.com/-uIbgVRNRkgs/UKlbgHPxQyI/AAAAAAAABpQ/CIzzbfBlg98/s320/north.png" width="199" /></a></div>(Sorry about my poor quality drawing but I'm sure you know what vectors pointing north look like.)<br /><br />At this point you may be tempted to say "aha! That's a continuous vector field on the sphere that's non-zero everywhere!" Alas, it's not defined everywhere. It's a vector field everywhere except at the north and south poles. If you're at the north pole, no non-zero vector can point north. And at the south pole every non-zero vector points north with no continuous way to pick just one.<br /><br />Given any vector field on the Earth we can imagine slicing the earth through the equator and flattening out the surfaces of the northern and souther hemispheres as two separate disks. Here's what you get if you do this with the north vector field (ignoring the problems at the poles for now):<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-T5sO_pDr2lU/UKlmO9fcXLI/AAAAAAAABqQ/F10iEtAUKt4/s1600/maps.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="278" src="http://4.bp.blogspot.com/-T5sO_pDr2lU/UKlmO9fcXLI/AAAAAAAABqQ/F10iEtAUKt4/s320/maps.png" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>You reconstruct the Earth again by gluing the two disks together according to the orange arrows, and then inflating. Any vector field on the surface of the Earth gives rise to a pair of vector fields on disks like this. But there will be a constraint. The vectors around the boundary of the two disks will match. In fact, vectors at the opposite ends of the orange arrow have to match. But they won't necessarily be equal as drawn in this diagram because the disk for the southern hemisphere corresponds to a view from below.<br /><br />Suppose we start at the point P and follow a loop eastwards along the equator. That's an anti-clockwise loop round the upper disk and simultaneously a clockwise loop round the lower disk. Here are the graphs:<br /><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-EpYQy9WyaUI/UKlmtFQd-nI/AAAAAAAABqY/y1gssmh5DSM/s1600/equator2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="302" src="http://1.bp.blogspot.com/-EpYQy9WyaUI/UKlmtFQd-nI/AAAAAAAABqY/y1gssmh5DSM/s320/equator2.png" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div>In the upper map the loop gives rise to winding number one. But in the lower map we get winding number minus one. So here's an important lesson: the winding number makes perfect sense for a flat domain in the plane. But on the surface of 3D objects it depends on how you flatten out your map. In this case, the winding number on the upper map is 2 more than the winding number for the lower map. (Remember, these fields aren't defined at the poles so we haven't contradicted the original theorem that the winding number is zero for any vector field defined in a simply-connected domain.)<br /><br />But here's the most important thing in this proof: the winding number for the upper hemisphere loop will be two more that the winding number for the lower hemisphere loop, no matter what vector field you have. This is because if you've travelled an angle θ around the equator, the vectors at opposite ends of the orange arrows will differ by an angle of 2θ. For example, once you're 90º around the earth, the north arrow is draw as a down-arrow in the upper graph and as an up-arrow in the lower graph. They're already 180º apart. You can see this is true for north pointing vectors literally by tracing with your fingers around the loops. It's also true for vectors pointing east. I'll leave that as an exercise for you, but here's a picture of some eastward vectors to get you started:<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-iJfy5KqLQQI/UKlgWt-tajI/AAAAAAAABpw/HaupyQIHOvE/s1600/east.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://3.bp.blogspot.com/-iJfy5KqLQQI/UKlgWt-tajI/AAAAAAAABpw/HaupyQIHOvE/s320/east.png" width="199" /></a></div>Along the equator, every vector on the surface is a linear combination of north and east vectors. So if it's true for both the north and east vectors then it must be true for all vectors. But if the graph for one picture of the equatorial loop has vectors that are 2θ more than the vectors for another graph, the first one must complete two revolutions more than the second one. So the first has a winding number two more than the second.<br /><br />If you had a continuous vector field that really was non-zero over the entire sphere, cutting the sphere in half would give a pair of continuous vector fields defined on disks. As disks are simply-connected, the theorem we started with tells us they must both have winding number zero as you loop around them. But we've also just shown that looping round one has winding number two more than looping around the other. This is a contradiction. So there is no continuous vector field that is non-zero everywhere. ∎<br /><br />If you get stuck above I strongly recommend trying to draw some continuous non-zero vector fields on the sphere, transferring them to disks, and counting winding numbers.<br /><br />Notice how we've done more than prove the theorem. We now know that if we have a continuous vector field on a sphere we can find out whether to look for its zeros in the northern or southern hemisphere by computing the winding numbers as above. At least one of the two winding numbers must be non-zero and that tells us which hemisphere we can be sure contains a zero. The fact that the two winding numbers differ by two, and not by just one, also tells us a bit about the nature of the zeros. But that's another story. That two is also related to the fact that the <a href="http://en.wikipedia.org/wiki/Euler_characteristic">Euler characteristic</a> of the sphere is two. It's also related to the Lefschetz index<br /><br />This proof is based on proofs I studied years ago relating to <a href="http://en.wikipedia.org/wiki/Chern_class">Chern classes</a>. I recently became interested in Chern classes again because they play an important role in understanding phenomena in solid state physics such as the <a href="http://en.wikipedia.org/wiki/Quantum_Hall_effect">quantum Hall effect</a>. That argument about slowly shrinking a loop leaving its winding number unchanged tells you a lot about <a href="http://en.wikipedia.org/wiki/Adiabatic_theorem">slowly changing</a> certain types of quantum system.<br /><br />It's possible I completely messed up. <a href="http://topologicalmusings.wordpress.com/2008/07/22/analyzing-the-hairy-ball-theorem/">Here</a>'s an "elementary" proof. It looks much harder than what I did. But I feel like I did faithfully capture, in pictures, an argument that's buried in <a href="http://www.amazon.com/Lectures-Riemann-surfaces-Princeton-mathematical/dp/B0007FHJES">Lectures on Riemann surfaces</a>. And it seems to correctly reproduce the <a href="http://en.wikipedia.org/wiki/Hairy_ball_theorem#Lefschetz_connection">Lefschetz number</a> of 2.sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com5tag:blogger.com,1999:blog-11295132.post-63100529475328406082012-04-07T16:22:00.000-07:002012-04-07T16:22:02.377-07:00Generalised entropy<b>Introduction</b><br />The <a href="http://en.wikipedia.org/wiki/Entropy">entropy</a> of a probability distribution can be seen as a measure of its uncertainty or a measure of the diversity of samples taken from it. Over the years I've talked lots about how <a href="http://blog.sigfpe.com/2007/03/monads-vector-spaces-and-quantum.html">probability theory</a> gives rise to a monad. This suggests the possibility that maybe the notion of entropy can be generalised to monads other than probability. So here goes...<br /><br /><pre>> {-# LANGUAGE MultiParamTypeClasses, FlexibleInstances, GeneralizedNewtypeDeriving #-}<br />> {-# LANGUAGE FunctionalDependencies, TypeSynonymInstances #-}<br /><br />> import Control.Monad<br />> import Control.Monad.Writer hiding (lift)<br /><br /><br /></pre><b>Shannon entropy</b><br />I've talked in the past about how there is some trickiness with defining the probability monad in Haskell because a good implementation requires use of the <tt>Eq</tt> typeclass, and hence restricted monads. Restricted monads are possible through a bunch of methods, but this time I don't want them.<br /><br />It's common to represent probability distributions on finite sets as lists of pairs where each pair <tt>(p, x)</tt> means <tt>x</tt> has a probability <tt>p</tt>. But I'm going to allow lists without the restriction that each <tt>x</tt> appears once and make my code work with these generalised distributions. When I compute the entropy, say, it will only be the usual entropy in the case that each <tt>x</tt> in the list is unique.<br /><br />So here's our type and some instances for it:<br /><br /><pre>> data P a = P [(a, Float)] deriving Show<br /><br />> instance Functor P where<br />> fmap f (P xs) = P [(f a, p) | (a, p) <- xs]<br /><br />> instance Monad P where<br />> return x = P [(x, 1)]<br />> P xss >>= f = P [(y, p*q) | (pxs, p) <- xss, let P ys = f pxs, (y, q) <- ys]<br /><br /></pre>We can easily compute the expected value of a distribution, and its entropy, like this:<br /><br /><pre>> expectation0 (P xs) = sum [x*p | (x, p) <- xs]<br />> entropy0 (P xs) = -sum [if p==0 then 0 else p*log p/log 2.0 | (_, p) <- xs]<br /><br /></pre>An important property of entropy is known as the grouping property which can be illustrated through an example tree like this:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-67NuhgTvHY4/T4BeuNAHpsI/AAAAAAAABBA/D4tg6ATA9CQ/s1600/tree.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="200" src="http://4.bp.blogspot.com/-67NuhgTvHY4/T4BeuNAHpsI/AAAAAAAABBA/D4tg6ATA9CQ/s320/tree.png" width="480" /></a></div><br />The entropy for the probability distribution of the final leaves is the sum of two components: (1) the entropy of the branch at the root of the tree and (2) the expected entropy of the subtrees. Here's some corresponding code. First simple bernoulli trials:<br /><br /><pre>> bernoulli p a b = P [(a, p), (b, 1-p)]<br /><br /></pre>Now the branch at the root of the tree:<br /><br /><pre>> root = bernoulli 0.3 False True<br /><br /></pre>We can compute the entropy for the distrbution on the leaves:<br /><br /><pre>> test1 = entropy0 $ do<br />> x <- root<br />> if x<br />> then bernoulli 0.2 3 4<br />> else bernoulli 0.4 5 6<br /><br /></pre>Or the sum of the root entropy and the expected subtree entropy:<br /><br /><pre>> test2 = entropy0 root + (expectation0 $ do<br />> x <- root<br />> if x<br />> then return $ entropy0 (bernoulli 0.2 3 4)<br />> else return $ entropy0 (bernoulli 0.4 5 6))<br /><br /></pre>You can confirm for yourself that <tt>test1 == test2</tt>.<br /><br />We can rewrite that a little. We're drawing <tt>True</tt> or <tt>False</tt> from <tt>root</tt> only to decide which distribution to use at the next stage. But we may as will pick the distribution itself at random. So define:<br /><br /><pre>> dist = bernoulli 0.3 (bernoulli 0.4 5 6) (bernoulli 0.2 3 4)<br /><br /></pre>And now we expect the equality of <tt>test3</tt> and <tt>test4</tt>:<br /><br /><pre>> test3 = entropy0 $ do<br />> x <- dist<br />> x<br /><br />> test4 = entropy0 dist + (expectation0 $ do<br />> x <- dist<br />> return $ entropy0 x)<br /><br /></pre>There's a more elegant way of writing this. Define:<br /><br /><pre>> left0 dist = entropy0 (join dist)<br />> right0 dist = entropy0 dist+expectation0 (fmap entropy0 dist)<br /><br /></pre>Now we expect <tt>left0 dist</tt> and <tt>right0 dist</tt> to always be equal. We've almost generalised to something that makes sense in the context of monads other than probability.<br /><br /><b>The algebra of a monad</b><br />Here are a couple of important properties of <tt>expectation0</tt>:<br /><br />1. <tt>expectation0 (return d) = d</tt><br />2. <tt>expectation0 (join d) = expectation0 (fmap expectation d)</tt><br /><br />In English: the expectation of certainty is just the certain value, and the expectation of an expectation is just the expectation. But these rules are precisely the conditions that define an <img src="https://chart.googleapis.com/chart?cht=tx&chl=M" style="vertical-align:middle">-<a href="http://en.wikipedia.org/wiki/Monad_%28category_theory%29#Algebras_for_a_monad">algebra</a>, where <img src="https://chart.googleapis.com/chart?cht=tx&chl=M" style="vertical-align:middle"> is a monad.<br /><br />So let's define a type class:<br /><br /><pre>> class Algebra m a | m -> a where<br />> expectation :: m a -> a<br /><br /></pre>We'll assume that when <tt>m</tt> is a monad, any instance satisfies the two laws above. Here's the instance for probability:<br /><br /><pre>> instance Algebra P Float where<br />> expectation (P xs) = sum [x*p | (x, p) <- xs]<br /><br /></pre>In keeping with the notion that entropy measure diversity let's also define:<br /><br /><pre>> class Diverse m r | m -> r where<br />> entropy :: m x -> r<br /><br /></pre>with the instance:<br /><br /><pre>> instance Diverse P Float where<br />> entropy (P xs) = -sum [if p==0 then 0 else p*log p/log 2.0 | (_, p) <- xs]<br /><br /></pre>It's not clear what laws we need but for now we'll assume a generalised entropy satisfies <tt>left dist == right dist </tt>:<br /><br /><pre>> left dist = entropy (join dist)<br />> right dist = entropy dist+expectation (fmap entropy dist)<br /><br /></pre>We'll call that the generalised grouping law.<br /><br /><b>Binary trees</b><br />It's not hard to find other structures that satisfy these laws if we cheat and use alternative structures to represent probabilities. For example We can make <tt>Tree</tt> an instance by assuming <tt>Fork</tt> represents a 50/50 chance of going one way or another:<br /><br /><pre>> data Tree a = Leaf a | Fork (Tree a) (Tree a) deriving Show<br /><br />> instance Functor Tree where<br />> fmap f (Leaf a) = Leaf (f a)<br />> fmap f (Fork l r) = Fork (fmap f l) (fmap f r)<br /><br />> instance Monad Tree where<br />> return x = Leaf x<br />> Leaf a >>= f = f a<br />> Fork l r >>= f = Fork (l >>= f) (r >>= f)<br /><br />> instance Algebra Tree Float where<br />> expectation (Leaf a) = a<br />> expectation (Fork l r) = 0.5*expectation l+0.5*expectation r<br /><br />> instance Diverse Tree Float where<br />> entropy (Leaf a) = 0<br />> entropy (Fork l r) = 1+0.5*entropy l+0.5*entropy r<br /><br /></pre><b>Lists</b><br />We could make non-empty lists into an instance by assuming a uniform distribution on the list. But another way to measure the diversity is simply to count the elements. We subtract one so that <tt>[x]</tt> corresponds to diversity zero. This subtraction gives us a non-trivial instance:<br /><br /><pre>> newtype L a = L [a] deriving (Show, Monad, Functor)<br /><br />> instance Algebra L Int where<br />> expectation (L xs) = sum xs<br /><br />> instance Diverse L Int where<br />> entropy (L xs) = length xs-1<br /><br /></pre><b>Tsallis entropy</b><br />There are measures of diversity for probability distributions that are distinct from Shannon entropy. An example is <a href="Tsallis_entropy">Tsallis entropy</a>. At this point I'd like a family of types parametrised by reals but Haskell doesn't support dependent types. So I'll just fix a real number <tt>q</tt> and we can define:<br /><br /><pre>> q = 2.5<br /><br />> data T a = T [(a, Float)] deriving Show<br /><br />> instance Functor T where<br />> fmap f (T xs) = T [(f a, p) | (a, p) <- xs]<br /><br />> instance Monad T where<br />> return x = T [(x, 1)]<br />> T xss >>= f = T [(y, p*q) | (pxs, p) <- xss, let T ys = f pxs, (y, q) <- ys]<br /><br />> instance Algebra T Float where<br />> expectation (T xs) = sum [x*p**q | (x, p) <- xs]<br /><br />> instance Diverse T Float where<br />> entropy (T xs) = (1-sum [p**q | (_, p) <- xs])/(q-1)<br /><br /></pre>And again we find our generalised grouping rule for entropy holds.<br /><br /><b>Operads</b><br />This is all derived from Tom Leinster's <a href="http://golem.ph.utexas.edu/category/2011/05/an_operadic_introduction_to_en.html">post</a> last year at the <a href="http://golem.ph.utexas.edu/category/">n-category cafe</a>. As I talked about <a href="http://blog.sigfpe.com/2008/10/operads-and-their-monads.html">here</a> there's a close relationship between monads and operads. Operads area a bit like container monads where the containers don't contain anything, but just have holes where contents could be placed. This makes operads a better place to work because you don't have the awkward issue I started with: having to disallow lists of value/probability pairs where the same value can appear more than once. Nonetheless, in (unrestricted) Haskell monads you don't have <tt>Eq</tt> available so you can't actually have definitions of <tt>return</tt> or <tt>>>=</tt> that can notice the equality of two elements. If such definitions were possible, the grouping law would no longer work as stated above.<br /><br /><b>Crossed homomorphisms</b><br />The generalised grouping law even makes sense for very different monads. For the Reader monad the law gives the definition of a <a href="http://en.wikipedia.org/wiki/Group_cohomology#H1">crossed homomorphism</a>. It's pretty weird seeing a notion from group cohomology emerge like this and I recommend skipping to the final section unless you care about this sort of thing. But if you do, this is related to research I did a long time ago. This is to test that the <a href="http://en.wikipedia.org/wiki/Schwarzian_derivative">Schwarzian derivative</a> really does give rise to a crossed homomorphism.<br /><br />Firstly let me set up some automatic differentiation code:<br /><br /><pre>> data D a = D { re::a, im::a } deriving (Show, Ord, Eq)<br /><br />> instance Num a => Num (D a) where<br />> fromInteger n = D (fromInteger n) 0<br />> D a a'+D b b' = D (a+b) (a'+b')<br />> D a a'*D b b' = D (a*b) (a*b'+a'*b)<br />> D a a'-D b b' = D (a-b) (a'-b')<br /><br />> instance Fractional a => Fractional (D a) where<br />> fromRational n = D (fromRational n) 0<br />> D a a'/D b b' = let q = 1/b in D (a*q) ((-a*b'+a'*b)*q*q)<br /><br />> lift x = D x 0<br /><br />> d f x = im (f (D x 1))<br /><br />> raised f = re . f . lift<br />> raised2 = raised . raised<br />> raised3 = raised2 . raised<br /><br /></pre>The <tt>Cn</tt> are the <tt>n</tt>-times (automatically) differentiable functions. Unfortunately the <tt>Endo</tt> defined in <tt>Data.Monoid</tt> acts the wrong way round from what I want so I need a <tt>Dual</tt>:<br /><br /><pre>> type C1 = Dual (Endo (D Double))<br />> type C3 = Dual (Endo (D (D (D Double))))<br />> type C4 = Dual (Endo (D (D (D (D Double)))))<br /><br />> instance Eq (Endo (D Double))<br />> instance Ord (Endo (D Double))<br /><br /></pre>A silly <tt>Show</tt> instance that simply evaluates a function at a number I chose randomly: 1.234.<br /><br /><pre>> instance Show (Endo (D Double)) where<br />> show (Endo f) = show (f 1.234)<br /><br />> instance Num C1 where<br />> fromInteger n = Dual (Endo (\x -> fromInteger n))<br />> Dual (Endo f)+Dual (Endo g) = Dual (Endo (\x -> f x + g x))<br />> Dual (Endo f)-Dual (Endo g) = Dual (Endo (\x -> f x - g x))<br />> Dual (Endo f)*Dual (Endo g) = Dual (Endo (\x -> f x * g x))<br /><br />> instance Fractional C1 where<br />> fromRational n = Dual (Endo (\x -> fromRational n))<br />> Dual (Endo f)/Dual (Endo g) = Dual (Endo (\x -> f x / g x))<br /><br />> newtype Q a = Q (Writer C4 a) deriving (Monad, Functor)<br /><br /></pre>We can give <tt>Q a</tt> a geometrical interpretation. The underlying type is a pair <tt>(a, C4)</tt>. If we think of elements of <tt>C4</tt> as charts <a href="http://en.wikipedia.org/wiki/Manifold#Charts">charts</a> on a piece of Riemann surface then for any <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle">, an element <img src="https://chart.googleapis.com/chart?cht=tx&chl=%28%5cpsi%2c+z%29" style="vertical-align:middle"> of <tt>(a, C4)</tt> represents a local piece of a section of the <img src="https://chart.googleapis.com/chart?cht=tx&chl=n" style="vertical-align:middle">th tensor power of the canonical bundle. Ie. we can think of it as representing <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5cpsi+dz%5en" style="vertical-align:middle">. I'll concentrate on the case <img src="https://chart.googleapis.com/chart?cht=tx&chl=n%3d2" style="vertical-align:middle"> which gives quadratic differentials. We can think of an element of <tt>((a, C4), C4)</tt> as forms where we're composing two charts. We can collapse down to an ordinary chart by using the chain rule. Here's the code:<br /><br /><pre>> instance Algebra Q C1 where<br />> expectation (Q ma) = let (Dual (Endo a), Dual (Endo f)) = runWriter ma<br />> in Dual (Endo (\x -> a (raised3 f x)*(raised2 (d f) x)^2))<br /><br /></pre>Now we can define the Schwarzian derivative:<br /><br /><pre>> schwarzian f x = let f0 = raised3 f x<br />> f1 = raised2 (d f) x<br />> f2 = raised (d $ d f) x<br />> f3 = (d $ d $ d f) x<br />> in f3/f1-1.5*(f2/f1)^2<br /><br /></pre>And somwehat bizarrely, we now have a generalised entropy:<br /><br /><pre>> instance Diverse Q C1 where<br />> entropy (Q ma) = let (_, Dual (Endo f)) = runWriter ma<br />> in Dual (Endo (\x -> schwarzian f x))<br /><br /></pre>This is the construction that gives rise to the <a href="http://en.wikipedia.org/wiki/Virasoro_algebra">Virasoro algebra</a> which plays such an important role in String Theory.<br /><br /><b>Some tests</b><br />And here's a bunch of tests. I'd have used <tt>QuickCheck</tt> but it won't install for me today...<br /><br /><pre>> test :: (Algebra m t, Diverse m t, Num t, Functor m, Monad m) => m (m x) -> IO ()<br />> test x = do<br />> print (left x, right x)<br /><br />> main = do<br />> test $ L [L [1, 2, 3], L [2, 3, 4], L [1], L [5], L [2, 7::Int]]<br />> test $ P [(P [(0, 0.5), (1, 0.5)], 0.5), (P [(2, 0.5), (3::Int, 0.5)], 0.5::Float)]<br />> test $ T [(T [(0, 0.5), (1, 0.5)], 0.5), (T [(2, 0.5), (3::Int, 0.5)], 0.5::Float)]<br />> test $ Leaf (Leaf 1 `Fork` Leaf 2) `Fork` Leaf (Leaf 3 `Fork` (Leaf 4 `Fork` Leaf 5))<br />> test $ (Q (writer<br />> (Q (writer (Dual (Endo (\x -> x)),<br />> Dual (Endo (\x -> x^2+1)))),<br />> Dual (Endo (\x -> (2+x)/(3+x*x))))) :: Q (Q C3))<br /></pre>sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com1tag:blogger.com,1999:blog-11295132.post-76703357330995948642012-03-17T13:30:00.001-07:002012-03-17T13:45:04.487-07:00Overloading Python list comprehension<b>Introduction</b><br />Python is very flexible in the way it allows you to overload various features of its syntax. For example most of the binary operators can be overloaded. But one part of the syntax that can't be overloaded is list comprehension ie. expressions like <tt>[f(x) for x in y]</tt>.<br /><br />What might it mean to overload this notation? Let's consider something simpler first, overloading the binary operator <tt>+</tt>. The expression <tt>a+b</tt> is interpreted as <tt>a.__add__(b)</tt> if <tt>a</tt> is of class type. So overloading <tt>+</tt> means nothing more than writing a function. So if we can rewrite list comprehensions in terms of a function (or functions) then we can overload the notation by providing alternative definitions for those functions. Python doesn't provide a facility for doing this directly, but we can at least think about what it might mean to do this. Later we'll see how to tweak the Python interpreter to make it possible.<br /><br /><b>map</b><br />Consider the expression<br /><pre>[a for x in y]<br /></pre>Here the single letter variables are 'metavariables' representing fragments of Python code. To a good approximation this is equal to:<br /><pre>map(lambda x: a, y)<br /></pre>(BTW Everything I say here is "to a good approximation". Python is an incredibly complex language and I'm not good enough at it to make any categorical statements about when one fragment of code is the same as another.)<br /><br />So it's tempting to see list comprehensions as syntactic sugar for <tt>map</tt>, in which case one approach to overloading comprehension is to consider interpreting it in terms of replacements for <tt>map</tt>. But this isn't a very powerful overloading. It just gives us a slightly different way to write something that's already straightforward.<br /><br /><b>concatMap</b><br />Another reason for not simply seeing list comprehension in terms of <tt>map</tt> is that nested list comprehensions need another operation. Consider<br /><pre>[(y, z) for y in [1, 2] for z in ['a', 'b']]<br /></pre>This isn't quite the same as<br /><pre>[[(y, z) for z in ['a', 'b']] for y in [1, 2]]<br /></pre>but it's close. The latter produces nested lists whereas the first gives one flat list. We can think of nested comprehensions as applying a flattening operation. Let's use list comprehension to implement flattening:<br /><pre>def concat(xs):<br /> return [y for x in xs for y in x]<br /></pre>We now write our nested comprehension as:<br /><pre>concat([[(y, z) for z in ['a', 'b']] for y in [1, 2]])<br /></pre>We know how to write non-nested comprehensions using <tt>map</tt> so we get:<br /><pre>concat(map(lambda y: [(y, z) for z in ['a', 'b']], [1, 2]))<br /></pre>And rewriting the inner comprehension we get:<br /><pre>concat(map(lambda y: map(lambda z: (y, z), ['a', 'b']), [1, 2]))<br /></pre>Every time we add another level of nesting we're going to need another <tt>concat</tt>. But the innermost <tt>map</tt> doesn't have a <tt>concat</tt>. Purely for reasons of symmetry we can ensure every <tt>map</tt> has a <tt>concat</tt> by enclosing the innermost element as a singleton list:<br /><pre>concat(map(lambda y: concat(map(lambda z: [(y, z)], ['a', 'b'])), [1, 2]))<br /></pre>Every <tt>map</tt> has a <tt>concat</tt> so we can simplify slightly. Let's define:<br /><pre>def concatMap(f, xs):<br /> return [f(y) for x in xs for y in x]<br /><br />def singleton(x):<br /> return [x]<br /></pre>Our expression becomes:<br /><pre>concatMap(lambda y: concatMap(lambda z: singleton((y, z)), ['a', 'b']), [1, 2])<br /></pre>Importantly we've completely rewritten the comprehension in terms of <tt>concatMap</tt> and <tt>singleton</tt>. By changing the meaning of these functions we can change the meaning of comprehension notation, or at least we could if the Python interpreter defined comprehension this way. It doesn't, but we can still reason about it. Although any comprehension that doesn't use <tt>if</tt>s can be rewritten to use these functions, I won't give a formal description of the procedure. Instead I'll provide code to perform the rewrite later. While I'm at it, I'll also handle the <tt>if</tt>s.<br /><br /><b>Laws</b><br />Freely redefining <tt>singleton</tt> and <tt>concatMap</tt> to redefine comprehension could get weird. If we're going to redefine them we should at least try to define them so that list comprehension still has some familiar properties. For example, for <tt>y</tt> a list we usually expect:<br /><pre>y == [x for x in y]<br /></pre>In other words<br /><pre>y == concatMap(lambda x: singleton(x), y)<br /></pre>At this point I could give a whole bunch more laws but it's time to own up.<br /><br /><b>Monads</b><br />A pair of functions <tt>singleton</tt> and <tt>concatMap</tt>, along with a bunch of laws, are essentially the same thing as a <a href="http://en.wikipedia.org/wiki/Monad_%28functional_programming%29">monad</a>. In Haskell, <tt>concatMap</tt> is usually called <tt>bind</tt> and <tt>singleton</tt> is called <tt>return</tt>. What I've done here is show how Wadler's <a href="http://homepages.inf.ed.ac.uk/wadler/topics/monads.html">Comprehending Monads</a> paper might look like in Python. Haskell has specialised monad notation built into its grammar. But what's less well known is that so does Python! The catch is that although the grammar is right, the semantics can't be generalised beyond lists.<br /><br /><b>Monad-Python</b><br />One great thing about Python is that there seem to be libraries for working with every aspect of Python internals. So it's fairly easy to write a simple Python interpreter that rewrites list comprehensions to use <tt>singleton</tt> and <tt>concatMap</tt>. I've placed the source on <a href="https://github.com/dpiponi/Monad-Python">github</a>. Use <tt>mpython.py</tt> instead of <tt>python</tt> as your interpreter. I've tested it with Python 2.6 and 2.7.<br /><br />When using <tt>mpython</tt>, list comprehension uses whatever definitions of <tt>__mapConcat__</tt> and <tt>__singleton__</tt> are currently in scope. By default they are the definitions I gave above so we get something close to the usual list comprehension.<br /><br />An example of the kind of code you can run with <tt>mpython.py</tt> is:<br /><pre>import math<br /><br />def __concatMap__(k, m):<br /> return lambda c:m(lambda a:k(a)(c))<br /><br />def __singleton__(x):<br /> return lambda f:f(x)<br /><br />def callCC(f):<br /> return lambda c:f(lambda a:lambda _:c(a))(c)<br /><br />def __fail__():<br /> raise "Failure is not an option for continuations"<br /><br />def ret(x):<br /> return __singleton__(x)<br /><br />def id(x):<br /> return x<br /><br />def solve(a, b, c):<br /> return callCC(lambda throw: [((-b-d)/(2*a), (-b+d)/(2*a))<br /> for a0 in (throw("Not quadratic") if a==0 else ret(a))<br /> for d2 in ret(b*b-4*a*c)<br /> for d in (ret(math.sqrt(d2)) if d2>=0 else throw("No roots"))<br /> ])<br /><br />print solve(1, 0, -9)(id)<br />print solve(1, 1, 9)(id)<br />print solve(0, 1, 9)(id)<br /></pre>I have defined our functions so that comprehension syntax gives us the continuation monad. This makes continuation passing style relatively painless in Python. (At least easier than chaining many <tt>lambda</tt>s.) I have then defined <a href="http://en.wikibooks.org/wiki/Haskell/Continuation_passing_style#callCC">callCC</a> to be similar to its definition in Haskell. There are many uses for <tt>callCC</tt> including the implementation of <tt>goto</tt>. Above I use it in a trivial way to throw exceptions.<br /><br /><b>Conclusion</b><br />My script <tt>mpython.py</tt> is a long way from an industrial strength interpreter and I'm not proposing the above as an extension to Python. My goal was simply to show how Haskell-style monads are not as alien to Python as you might think. In fact, it's reasonable to say that Python already supports one flavour of specialised monad syntax. Most users don't realise it as such because it has been hard-wired to work with just one monad, lists.<br /><br />BTW if you attempt to implement all of the other Haskell monads you'll find that Haskell behaves a little differently because of its laziness. You can recover some of that laziness by careful use of continuations in Python. But I've no time to go into that now.sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com8tag:blogger.com,1999:blog-11295132.post-54985672461748303472012-02-11T07:47:00.000-08:002012-02-11T07:47:08.597-08:00Using Lawvere theories to combine effects<pre>> {-# LANGUAGE MultiParamTypeClasses, ExplicitForAll, RankNTypes, FlexibleInstances, FlexibleContexts, TypeSynonymInstances #-}<br /><br />> import Data.Monoid<br />> import Data.Functor.Identity<br />> import Control.Monad.Writer<br /><br /></pre>In an <a href="http://blog.sigfpe.com/2009/12/where-do-monads-come-from.html">earlier post</a> I talked about how monads arise from free algebras. Let me recap a bit.<br /><br />In <a href="http://blog.sigfpe.com/2012/02/lawvere-theories-made-bit-easier.html">Part 1</a> I described algebras. They're sets with operations on them satisfying some laws. We can build new elements of an algebra from old ones by using its operations. Eg. if <tt>x</tt> and <tt>y</tt> are in an algebra then <tt>x `mappend` y</tt> must be in it too. Starting with a bunch of symbols, thought of as leaves, we can consider the set of all expressions trees we can build from them. If we consider pairs of trees to be equivalent if the laws say the corresponding expressions are equal, then the set of trees itself forms an algebra known as a free algebra (for the given theory).<br /><br />Let's start with some code. This type class says that the type <tt>b</tt> has leaves of type <tt>a</tt>:<br /><br /><pre>> class Free a b where<br />> leaf :: a -> b<br /><br /></pre><b>Effects from monoids</b><br />Now we can make the type of all trees built from <tt>Monoid</tt> operations and including all leaves of type <tt>a</tt>:<br /><br /><pre>> data FreeMonoid a = FreeMonoid (forall b. (Monoid b, Free a b) => b)<br /><br /></pre>And we have:<br /><br /><pre>> instance Monoid (FreeMonoid a) where<br />> mempty = FreeMonoid mempty<br />> FreeMonoid a `mappend` FreeMonoid b = FreeMonoid (a `mappend` b)<br /><br /></pre>Unfortunately elements like <tt>e1</tt> and <tt>e2</tt> two ought to be equal but Haskell doesn't know this:<br /><br /><pre>> e1, e2 :: FreeMonoid Char<br />> e1 = FreeMonoid (leaf 'a' `mappend` (leaf 'b' `mappend` leaf 'c'))<br />> e2 = FreeMonoid ((leaf 'a' `mappend` leaf 'b') `mappend` leaf 'c')<br /><br /></pre>Instead we can manually construct a type that does respect equality in monoids. Elements of <tt>FreeMonoid</tt> are binary trees with a `mappend` at each node. Associativity means that we can always replace a tree with an equivalent one where the left branch is a leaf. We can also use the laws to eliminate any occurrence of <tt>mempty</tt>. So every element of <tt>FreeMonoid a</tt> is equivalent to one of the form:<br /><pre>Leaf x1 `mappend` (Leaf x2 `mappend` (... mempty))<br /></pre><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-BagBCT39yFU/TzaGZRkhu3I/AAAAAAAAA7c/Aaq9o_2PduU/s1600/list1.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="180" width="400" src="http://3.bp.blogspot.com/-BagBCT39yFU/TzaGZRkhu3I/AAAAAAAAA7c/Aaq9o_2PduU/s400/list1.png" /></a></div><br />In other words, free monoids are lists. We can make this explicit. The standard prelude already makes <tt>[]</tt> an instance of <tt>Monoid</tt> so we just need:<br /><br /><pre>> instance Free a [a] where<br />> leaf x = [x]<br /><br /></pre>Here's the isomorphism (modulo tree equivalence):<br /><br /><pre>> iso1 :: FreeMonoid a -> [a]<br />> iso1 (FreeMonoid x) = x<br /><br />> iso1' :: [a] -> FreeMonoid a<br />> iso1' [] = FreeMonoid mempty<br />> iso1' (a : as) = let FreeMonoid r = iso1' as<br />> in FreeMonoid (leaf a `mappend` r)<br /><br /></pre>As I talked about in that <a href="http://blog.sigfpe.com/2009/12/where-do-monads-come-from.html">earlier article</a>, free algebras give monads and the trees representing expressions in the algebra can be thought of as abstract syntax trees for domain specific languages. In this case it's the usual list monad. So the <tt>Monoid</tt> type class gives us a language for talking about non-determinism. The operation <tt>mappend</tt> gives us a way to "fork" a process and <tt>mempty</tt> gives as a way to "kill a thread". Here's an example using non-determinism to search for some Pythagorean triples:<br /><br /><pre>> test1 :: [(Int, Int, Int)]<br />> test1 = do<br />> a <- return 3 `mappend` return 4<br />> b <- return 4 `mappend` return 5<br />> c <- return 5 `mappend` return 6<br />> if a*a+b*b==c*c then return (a, b, c) else mempty<br /><br /></pre><b>Effects form M-sets</b><br />We can do exactly the same for <img src="https://chart.googleapis.com/chart?cht=tx&chl=M" style="vertical-align:middle">-sets.<br /><br /><pre>> class Monoid m => MSet m s where<br />> act :: m -> s -> s<br /><br />> data FreeMSet w a = FreeMSet (forall b. (MSet w b, Free a b) => b)<br /><br />> instance Monoid w => MSet w (FreeMSet w a) where<br />> m `act` FreeMSet b = FreeMSet (m `act` b)<br /><br /></pre>Again we have the problem that <tt>FreeMSet</tt> doesn't automatically make equivalent elements equal. But it's not hard to see that every element of <tt>FreeMSet</tt> is equivalent to one of the form:<br /><pre>m `act` (leaf x)<br /></pre>So the free <img src="https://chart.googleapis.com/chart?cht=tx&chl=M" style="vertical-align:middle">-set on the set of variables <img src="https://chart.googleapis.com/chart?cht=tx&chl=S" style="vertical-align:middle"> is simply the set of pairs <img src="https://chart.googleapis.com/chart?cht=tx&chl=M%5ctimes+S" style="vertical-align:middle">. This is the basis of Haskell's writer monad:<br /><br /><pre>> instance Monoid w => MSet w (Writer w a) where<br />> act w1 m = let (a, w2) = runWriter m in WriterT (Identity (a, w1 `mappend` w2))<br /><br />> instance Monoid w => Free a (Writer w a) where<br />> leaf x = return x<br /><br /></pre>Here's the isomorphism (again treating equivalent elements of <tt>FreeMSet</tt> as equal):<br /><br /><pre>> iso2 :: Monoid w => FreeMSet w a -> Writer w a<br />> iso2 (FreeMSet x) = x<br /><br />> iso2' :: Writer w a -> FreeMSet w a<br />> iso2' m = let (a, w) = runWriter m in FreeMSet (act w (leaf a))<br /><br /></pre>And now the <img src="https://chart.googleapis.com/chart?cht=tx&chl=M" style="vertical-align:middle">-set operation gives us an interface to an effect. This time the side effect of accumulating in a monoid:<br /><br /><pre>> test2 :: Writer String Int<br />> test2 = do<br />> act "foo" (return ())<br />> a <- return 2<br />> act "bar" (return ())<br />> b <- return (10*a)<br />> return b<br /><br /></pre><b>Combining effects</b><br /><br />And now we can finally combine the two effects of non-determinism and accumulation. We make the free algebra that is both a monoid and an <img src="https://chart.googleapis.com/chart?cht=tx&chl=M" style="vertical-align:middle">-set:<br /><br /><pre>> data FreeMMonoid w a = FreeMMonoid (forall b. (Monoid b, MSet w b, Free a b) => b)<br /><br />> instance Monoid w => Monoid (FreeMMonoid w a) where<br />> mempty = FreeMMonoid mempty<br />> FreeMMonoid a `mappend` FreeMMonoid b = FreeMMonoid (a `mappend` b)<br /><br />> instance Monoid w => MSet w (FreeMMonoid w a) where<br />> m `act` FreeMMonoid b = FreeMMonoid (m `act` b)<br /><br /></pre>Again we have the problem that equivalent elements aren't recognised as equal so we have to manually find a suitable type. For this we need to use the compatibility notion I introduced in Part 1. We can take 2 variables <img src="https://chart.googleapis.com/chart?cht=tx&chl=x" style="vertical-align:middle"> and <img src="https://chart.googleapis.com/chart?cht=tx&chl=y" style="vertical-align:middle"> and write them in a 1 by 2 array:<br /><blockquote><img src="https://chart.googleapis.com/chart?cht=tx&chl=%0a%5cbegin%7barray%7d%7bcc%7d%0ax+%26+y%0a%5cend%7barray%7d%0a" style="vertical-align:middle"><br /></blockquote>Apply <tt>mappend</tt> horizontally and <tt>act</tt> vertically to get:<br /><pre>m `act` (x `mappend` y)<br /></pre>Now apply <tt>act</tt> vertically and then <tt>mappend</tt> horizontally to get:<br /><pre>(m `act` x) `mappend` (m `act` y)<br /></pre>The law we want is:<br /><pre>m `act` (x `mappend` y) == (m `act` x) `mappend` (m `act` y)<br /></pre>Given an arbitrary tree in <tt>FreeMMonoid</tt> we can use this law to "push" all occurrences of <tt>act</tt> inwards. Ultimately every element can be written uniquely in the form:<br /><pre>act m1 (leaf x1) `mappend` (act m2 (leaf x2) `mappend` (... mempty)<br /></pre><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-NFKm6QlR144/TzaH0NtHeeI/AAAAAAAAA7s/B0gRXbgBk6E/s1600/list2.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="207" width="400" src="http://4.bp.blogspot.com/-NFKm6QlR144/TzaH0NtHeeI/AAAAAAAAA7s/B0gRXbgBk6E/s400/list2.png" /></a></div><br />We can then use the same argument as above to show that we end up with a list of pairs of elements of <img src="https://chart.googleapis.com/chart?cht=tx&chl=M%5ctimes+S" style="vertical-align:middle">. This is exactly what we get if we apply the <tt>WriterT</tt> monad transformer to <tt>[]</tt>. Here are the relevant instances:<br /><br /><pre>> instance Monoid w => Monoid (WriterT w [] a) where<br />> mempty = WriterT []<br />> WriterT xs `mappend` WriterT ys = WriterT (xs ++ ys)<br /><br />> instance Monoid w => MSet w (WriterT w [] a) where<br />> m `act` WriterT xs = WriterT $ map (\(x, w) -> (x, m `mappend` w)) xs<br /><br />> instance Monoid w => Free a (WriterT w [] a) where<br />> leaf x = return x<br /><br /></pre>Here's the isomorphism though we won't use it:<br /><br /><pre>> iso3 :: Monoid w => FreeMMonoid w a -> WriterT w [] a<br />> iso3 (FreeMMonoid x) = x<br /><br />> iso3' :: Monoid w => WriterT w [] a -> FreeMMonoid w a<br />> iso3' m = let xws = runWriterT m in FreeMMonoid $<br />> foldr mappend mempty $ map (\(x, w) -> act w (leaf x)) xws<br /><br /></pre>The monad <tt>WriterT (Product Float) []</tt> is in fact the <a href="http://blog.sigfpe.com/2007/02/monads-for-vector-spaces-probability.html">probability monad</a>. Here's an example of its use:<br /><br /><pre>> coin :: (Monoid a, MSet (Product Float) a, Free Bool a) => a<br />> coin = act (Product 0.5 :: Product Float) (leaf False)<br />> `mappend`<br />> act (Product 0.5 :: Product Float) (leaf True)<br /><br /></pre>Compute unnormalised conditional probability distribution on a pair of coin tosses given that first coin can't be <tt>True</tt> unless second one is:<br /><br /><pre>> test3 :: WriterT (Product Float) [] (Bool, Bool)<br />> test3 = do<br />> coin1 <- coin<br />> coin2 <- coin<br />> if coin1>coin2 then mempty else return (coin1, coin2)<br /><br /></pre>(Compare with Eric Kidd's <a href="http://www.randomhacks.net/articles/2007/02/21/refactoring-probability-distributions">article</a> that also 'refactors' probability theory.)<br /><br /><b>What just happened?</b><br />Something miraculous just happened though it may have been lost in the details. We combined the list monad and the writer monad to get a new monad. We did it without using monad transformers and without specifying an order for the two monads. It just so happens in this case that the result was the same as using a monad transformer.<br /><br /><b>M-set with M-set</b><br />We can try other products of theories. It's tricky to deal with a theory combined with itself because repeating a type class in a context doesn't do anything. We need to make another type class that looks exactly like <tt>MSet</tt> but with different names. The result is that the product of the theory of <img src="https://chart.googleapis.com/chart?cht=tx&chl=M" style="vertical-align:middle">-sets and the theory of <img src="https://chart.googleapis.com/chart?cht=tx&chl=N" style="vertical-align:middle">-sets is the theory of <img src="https://chart.googleapis.com/chart?cht=tx&chl=M%5ctimes+N" style="vertical-align:middle">-sets. This agrees with what we'd get from using monad transformers. It also agrees with intuition. <img src="https://chart.googleapis.com/chart?cht=tx&chl=M" style="vertical-align:middle">-sets correspond to the effect of accumulating data in a monoid. The product theory corresponds to using two accumulators simultaneously.<br /><br />(This makes me think type classes should take as arguments the name of the operations within them. That way a type can be an instance of the same type class in multiple ways. Compare with Agda modules.)<br /><br /><b>Monoid with monoid</b><br />This example illustrates why we can't expect a programming language to use the above method to combine theories. If an algebra has two multiplication operators with identities on it, and the two operators are compatible, then something <a href="http://en.wikipedia.org/wiki/Eckmann%E2%80%93Hilton_argument">surprising</a> happens. The multiplications turn out to be the same operation. What's more, the operation is commutative. So the product of the theory of monoids with itself is the theory of commutative monoids. A free commutative monoid is a multiset. Multisets require a very different implementation to lists and I doubt any automatic algebra combiner in the near future could discover one. (The Eckmann-Hilton argument also appears <a href="http://homotopytypetheory.org/2011/03/26/higher-fundamental-groups-are-abelian/">here</a>.)<br /><br /><b>The compatibility condition</b><br />To form the product of two theories we add in extra laws to ensure commutativity. If we don't add in such laws we get the sum of two theories. For the example theories I used here these theories can lead to quite complex types. For example the sum of the theory of <img src="https://chart.googleapis.com/chart?cht=tx&chl=M" style="vertical-align:middle">-sets and <img src="https://chart.googleapis.com/chart?cht=tx&chl=N" style="vertical-align:middle">-sets is, I think, the theory of <img src="https://chart.googleapis.com/chart?cht=tx&chl=M%5coplus+N" style="vertical-align:middle">-sets where <img src="https://chart.googleapis.com/chart?cht=tx&chl=M%5coplus+N" style="vertical-align:middle"> is the "free product" of monoids. I this is a bit of a messy object from the perspective of types. Other effects, however, may behave nicely with respect to <img src="https://chart.googleapis.com/chart?cht=tx&chl=%5coplus" style="vertical-align:middle">. I haven't yet investigated.<br /><br /><b>Conclusion</b><br />If you don't mind computing the relevant types by hand there are perfectly good alternative to monad transformers for combining effects. But it seems very difficult to automatically combine theories. In fact, I expect finding canonical forms for the elements of free algebras for a product theory isn't even computable. So this approach isn't going to replace monad transformers any time soon.<br /><br /><b>Exercise</b><br />Make a multiplication table showing the result of forming the product of algebras for lots of useful effects.sigfpehttp://www.blogger.com/profile/08096190433222340957noreply@blogger.com8