A Neighborhood of Infinity: 2011

Sunday, October 30, 2011

Quick and dirty reinversion of control

It's taken for granted by many people that Haskell and static types are incompatible with prototyping and quick-and-dirty hacks.

I wanted to put together some OpenGL code that had a script for how a bunch of graphics should be displayed. It was essentially an imperative specfification for my program. For quick and dirty hacks, GLUT is tolerable. But even when programming in C/C++, it's not supportive of programming in a straightforward imperative style because it uses inversion of control. Many graphics applications are written in a state machine style where the state machine gets to tick once each time there is an event callback. This really doesn't fit the imperative script style.

But it is possible to reinvert inversion of control in any language that supports continuations. And that includes languages like Python that support linear continuations in the form of generators. But I'm using Haskell here.

Continuations reify the remainder of a computation. Or in more down to earth language: they allow you to grab the stuff you're about to do as a function, put it on ice for a while, and then carry on doing it later. So imagine we had a block of imperative code and that we'd like, at each GLUT callback, to make some progress through this block. We can use continuations like this: each time we want to yield control back to the main loop we simply grab the remainder or our 'script' as a continuation and make it the callback to be executed next time GLUT is ready.

The slight wrinkle is that OpenGL/GLUT calls use IO. To combine IO and continuations we need the ContT monad transformer.

I'll do everything except the yield function first and get back to that at the end.

Some standard library imports:

> import Graphics.UI.GLUT
> import Control.Monad.Cont

Some simple code to draw a line from left to right:

> display :: GLdouble -> IO ()
> display y = do
>   clear [ColorBuffer]
>  
>   renderPrimitive LineStrip $ do
>     vertex (Vertex2 (-1) (-y))
>     vertex (Vertex2 1 y)

>   swapBuffers
>   postRedisplay Nothing

Some standard OpenGL/GLUT setup:

> main = do
>   (progname, _) <- getArgsAndInitialize
>   initialWindowSize  $= Size 500 500
>   initialDisplayMode $= [DoubleBuffered, RGBMode]
>   createWindow "Bounce!"

>   matrixMode $= Modelview 0
>   loadIdentity

>   matrixMode $= Projection
>   loadIdentity
>   ortho (-1) 1 (-1) 1 (-1) 1

Our script is called before the main loop.

>   imperative
>   mainLoop

And now comes the actual script. Apart from the liftIO calls this should be almost as easy to read as BASIC programming from the days of yore:

> imperative = flip runContT return $ do

>   liftIO $ print "Start!"

>   forever $ do

>     forM_ [-1, -0.992 .. 1.0] $ \y -> do
>       render $ display y
>       yield

>     liftIO $ print "Bounce!"

>     forM_ [-1, -0.992 .. 1.0] $ \y -> do
>       render $ display (-y)
>       yield

>     liftIO $ print "Bounce!"
>     yield

The first thing to note is that render doesn't actually do any rendering. At the end of the day we can't tell GLUT when to render, it only calls you. So instead render tells GLUT what to do next time it's in the mood for a bit of rendering:

> render f = liftIO $ displayCallback $= f

That leaves one thing to explain: yield. It needs to grab the remainder of the script and package it up in a form suitable for installation as an idle callback. But there's a catch: continuations are notorious for making your head explode. If you're throwing together a quick and dirty hack, that's the last thing you need. Here's where static types come to the rescue. As Conor McBride points out, we want to just do the easy thing and follow gravity downwards.

So first we try to guess the type of yield. We know we're working with the ContT IO monad. So its type is going to be ContT IO a for some a. There's no particular type of data we want to get out of yield, it's just a thing we want executed. So we can guess the type is ContT IO (), the type () being the generic filler type when we don't actually have any data.

Let's look at the definition of ContT:

newtype ContT r m a = Cont {
    runContT :: (a -> m r) -> m r
}

The type r is the final return type from our continuation. We're not interested in a return value, we just want to *do* stuff. So we expect r to be () as well.

So yield must essentially be of type (() -> IO ()) -> IO ().

So we want to concoct something of this type using GLUT's idleCallback function. As yield must take a function as argument it must look something like:

yield = ContT $ \f -> ...

We know that f is of type () -> IO (). So there's only one thing we can do with it: apply it to (). That gives us something of type IO (). That's precisely the type of GLUT's idleCallback. So we put it all together:

> yield = ContT $ \f -> idleCallback $= Just (f () )

The code now works. I didn't have to spend even a moment thinking about the meaning of a continuation. Implementing yield was about as hard as putting together a jigsaw puzzle with three pieces. There's only so many ways you can put the pieces together.

And that's a simple example of why I often like to write quick-and-dirty code with a statically typed language.

(Oh, and I'm not trying to take part in a war. I like to prototype in many different languages, some of which are dynamically typed.)

PS Note also that the above code illustrates one way to avoid IORefs in GLUT code.

Saturday, August 13, 2011

Computing errors with square roots of infinitesimals.

Abstract
Automatic differentiation (AD) gives a way to carry out uncertainty propagation. But used in the obvious way it leads to bias. This article introduces "square roots of infinitesimals" that can be used to give more accurate results.

Introduction
In the real world measurements have errors and we often want to know how much our final answers are affected by those errors. One tool for measuring the sensitivity to errors of our results is calculus. In fact, we can use automatic differentiation to give us a nice way to model error. Here's an implementation:

> import Control.Monad
> import Control.Monad.State
> import Control.Applicative
> import qualified Data.IntMap as I

> infixl 6 .+.
> infixl 7 .*
> infixl 7 *.

> data D a = D a a deriving (Eq, Show)

> instance Num a => Num (D a) where
>   D x a+D x' a' = D (x + x') (a + a')
>   D x a*D x' a' = D (x*x') (a*x' + x*a')
>   negate (D x a) = D (negate x) (negate a)
>   fromInteger n = D (fromInteger n) 0
>   abs _ = error "No abs"
>   signum _ = error "No signum"

> d = D 0 1

As I've talked about before, the value d can be thought of as an infinitesimal number whose square is zero. However, to first order we can replace d with a small number and use it to compute errors. Here's a function to perform such a substitution:

> approx :: Num a => D a -> a -> a
> approx (D x d) e = x+d*e

Suppose we have a square whose side we've measured as 1m to an accuracy of 1cm. We can represent this as:

> sq_side = D 1 0.01

We can now compute the area:

> sq_area = sq_side^2

We get D 1.0 2.0e-2. We can interpret this as meaning the area is 1m² with an accuracy of 0.02m².

We can make "to an accuracy of" more precise. Differentiation models a function locally as an approximate linear function. (1m+δ)²=1m²+2mδ+O(δ²). So at 1m, the function to compute area locally scales lengths by 2m. So if the length measurement is actually a sample from a distribution with given mean 1m and standard deviation 0.01m, the area is approximately a sample from a distribution with mean 1m² and SD 0.02m².

This approach is nice, but sometimes we want a little more accuracy in our estimates. In particular, if you square a sample from a normal distribution with small variance and positive mean, then the nonlinearity of the squaring operation means that samples that are larger than the mean move further away from the mean, when squared, than samples less than the mean. So we should actually expect our area computations to be slightly biased upwards from 1m². Unfortunately, this is a second order effect that isn't visible from looking only at first derivatives.

That's not a problem, we can easily compute second derivatives using automatic differentiation. However, that can complicate things. What happens if we use multiple measurements to compute a quantity? Each one is a different sample from a different distribution and we don't want these measurements to be correlated. If we approach this in the obvious way, when we want to use n measurements we'll need to compute n² partial second derivatives. However, by tweaking AD slightly we'll only need n derivatives.

Square roots of infinitesimals
In addition to the usual infinitesimal d we want to introduce quantities, w_i, that represent independent random "noise" variables that are infinitesimal in size. We'll be interested in expectation values so we'll also need an expectation function, e. We want e(w_i)=0. But w_i² is always positive so its expectation is always greater than or equal to zero. We want our random variables to be infinitesimal so we pick e(w_i²)=d. We also want e(w_iw_j)=0 because of independence. If the w_i are already infinitesimal, the dw_i should be zero. So let's define an algebraic structure that captures these relationships. So we extend D a by introducing w i so that:

Any element of this algebra can be written as x+ad+Σb_iz_i. We represent b sparsely by using an IntMap. Here's an implementation:

> data S a = S a a (I.IntMap a) deriving (Eq, Show)

> (.+.) :: Num a => I.IntMap a -> I.IntMap a -> I.IntMap a
> ( *.) :: Num a => a -> I.IntMap a -> I.IntMap a
> (.*)  :: Num a => I.IntMap a -> a -> I.IntMap a
> (.*.) :: Num a => I.IntMap a -> I.IntMap a -> a

> (.+.) = I.unionWith (+)
> a *. v = I.map (a *) v
> v .* b = I.map (* b) v
> a .*. b = I.fold (+) 0 $ I.intersectionWith (*) a b

> instance Num a => Num (S a) where
>   S x a b+S x' a' b' = S (x + x') (a + a') (b .+. b')
>   S x a b*S x' a' b' = S (x*x') (a*x' + x*a' + b.*.b') (x*.b' .+. b.*x')
>   negate (S x a b) = S (negate x) (negate a) (I.map negate b)
>   fromInteger n = S (fromInteger n) 0 I.empty
>   abs _ = error "No abs"
>   signum _ = error "No signum"

Here are the individual w i:

> w :: Num a => Int -> S a
> w i = S 0 0 (I.fromList [(i, 1)])

We compute expectation values linearly by mapping the w i to zero:

> e :: Num a => S a -> D a
> e (S x a _) = D x a

We can also represent numbers whose values we know precisely:

> sure x = S x 0 I.empty

Example
Let's revisit the area example. This time we can represent the length of the side of our square as

> sq_side' = 1+0.01*w 0
> sq_area' = sq_side'^2

We get S 1.0 1.0e-4 (fromList [(0,2.0e-2)]). We can directly read off that we have a bias of 10^-4m² which is 1cm^2. We can encapsulate this as:

> mean f = approx (e f) 1

We can directly read off the variance from the element of the algebra. However, we can also compute the variance using mean. It's just:

> var f = mean ((f-sure (mean f))^2)

(Note that this gives a very slightly different result from the value you can read off directly from the S object. It depends on whether we're measuring the deviation around the unbiased or biased mean. To the order we're considering here the difference is small. Here's var' anyway:

> var' (S _ _ v) = I.fold (+) 0 $ I.map (\x -> x^2) v

)

We can also define covariance:

> cov f g = mean ((f-sure (mean f))*(g-sure (mean g)))

More functions
We can now follow through just like with automatic differentiation to compute lots more functions. We use the fact that:

> instance Fractional a => Fractional (S a) where
>   fromRational x = S (fromRational x) 0 I.empty
>   recip (S x a b) = let r = recip x
>                     in S r (-a*r*r+r*r*r*(b.*.b)) ((-r*r)*.b)

> instance Floating a => Floating (S a) where
>   pi = sure pi
>   sin (S x a b) = let s = sin x
>                       c = cos x
>                   in S s (a*c - s/2*(b.*.b)) (c*.b)
>   cos (S x a b) = let s = sin x
>                       c = cos x
>                   in S c (-a*s - c/2*(b.*.b)) ((-s)*.b)
>   exp (S x a b) = let e = exp x
>                   in S e (a*e + e/2*(b.*.b)) (e*.b)
>   sqrt (S x a b) = let s = sqrt x
>                   in S s (a/(2*s)-1/(4*s*s*s)*(b.*.b)) (1/(2*s)*.b)
>   log (S x a b) = let r = 1/x
>                   in S (log x) (r*a-r*r/2*(b.*.b)) (r*.b)
>   asin = undefined
>   acos = undefined
>   atan = undefined
>   sinh = undefined
>   cosh = undefined
>   tanh = undefined
>   asinh = undefined
>   acosh = undefined
>   atanh = undefined

A real example
Let's make this effort worthwhile. We'll compute errors for a computation that uses the errors in a messy nonlinear way. Suppose we're in the lab measuring radioactive decay. We measure the geiger counter reading at times t = 0hr, 1hr, 2hr, 3hr, 4hr at which point we compute an estimate for when the decay will drop to one tenth of its original value. We'll assume the decay fits a model counts/sec = a exp(-λt) and that the counts have an error with SD 0.05. We're going to compute the error in the estimated time to hit one tenth radioactivity in the case when the half life is 30 minutes and a=2:

> t = [0..4]
> counts = map (\i-> 2*exp(-0.5*fromIntegral i)+0.05*w i) t

We'll be fitting a curve using logarithmic regression so we'll need the following function. Given a pair of lists x and y it returns (m, c) where y=mx+c is the standard least squares fit.

> regress :: Fractional a => [a] -> [a] -> (a, a)
> regress x y =
>     let sx = sum x
>         sy = sum y
>         sxx = sum $ map (^2) x
>         sxy = sum $ zipWith (*) x y
>         n = fromIntegral (length x)
>         s = 1/(sx*sx-n*sxx)
>     in (s*sx*sy-s*n*sxy, -s*sxx*sy+s*sx*sxy)

Logarithmic regression:

> (m, c) = regress (map fromIntegral t) (map log counts)
> lambda = -m
> a = exp c
> t_tenth = -log (0.1/a)/lambda

We can now go ahead and compute the mean and variance of our estimate:

*Main> mean t_tenth 
5.98036172868899
*Main> var t_tenth
0.15583537298560224

The correct time is about 5.991 so the regression method above is biased by about 0.01. If we repeated the same experiment over and over again and averaged the estimates we got from logarithmic regression the process would not converge to the correct result. In fact, we can compute "ground truth" by simulating the experiment a million times in Octave and estimate the mean and variance from that. The code is in the appendix. Obviously this is a much slower process but it clearly demonstrates the biasedness of using regression this way.

GNU Octave, version 3.4.0
Copyright (C) 2011 John W. Eaton and others.
ans =  5.9798
ans =  0.15948

Final thoughts
This is yet another example of extending automatic differentiation. We have variants for single variable differentiation, multivariate differentiation, multiple differentiation, divided differences, splitting a function into odd and even parts and now automatic error propagation.

This stuff was very loosely inspired by reading An Introduction to Stochastic Processes in Physics. I'm attempting to capture the semi-formal rules used in that book to reason about differentials and you can think of the algebra above as representing stochastic differentials. I made a guess that the algebra is called the Itō algebra. Sure enough, you'll get a few hits.

The most similar published work I can find is Automatic Propagation of Uncertainties but it seems to just use ordinary AD.

This technique may be useful for Extended Kalman Filtering.

I haven't done the work to make precise statements about how accurate you can expect my estimates of expectations to be.

It's possible to implement a monad with syntax similar to other probability monads by using state to bump up the i in w i each time you generate a new random variable. But bear in mind, these are always intended to be used as *infinitesimal* random variables.

Appendix: Octave code

m = 5;
n = 1000000;

x = repmat([0:m-1]',1,n);
y = repmat([2*exp(-0.5*[0:m-1]')],1,n)+0.05*normrnd(0,1,m,n);

sx = sum(x);
sxx = sum(x.*x);
p = sum(log(y));
q = sum(x.*log(y));

s = 1./(sx.*sx-m*sxx);
m = s.*sx.*p-m*s.*q; # Redefined
c = -s.*sxx.*p+s.*sx.*q;

lambda = -m;
a = exp(c);
x_tenth = -log(0.1./a)./lambda;

mean(x_tenth)
var(x_tenth)

Saturday, July 23, 2011

Profunctors in Haskell

> {-# LANGUAGE TypeSynonymInstances, RankNTypes, ExistentialQuantification #-}

Introduction
When I wrote about coends a while back I made up a term 'difunctor'. More recently it was pointed out to me that the correct word for this concept is 'profunctor', but unfortunately my knowledge came from MacLane which mentions that word nowhere.

Profunctors are ubiquitous in Haskell programming. Probably the most natural definition of Hughes Arrows is via profunctors. Profunctors also play a role a little like tensors leading to a use of the terms 'covariant' and 'contravariant' that looks remarkably like the way those terms are used in tensor calculus.

For categories C and D, A profunctor is a functor D^op×C→Set and is written C↛D. (I hope that arrow between C and D is in your font. It's missing on iOS.)

I'll reuse my Haskell approximation to that definition:

> class Profunctor h where
>   lmap :: (d' -> d) -> h d c -> h d' c
>   rmap :: (c -> c') -> h d c -> h d c'

We need cofunctoriality for the first argument and functoriality for the second:

lmap (f . g) == lmap g . lmap f
rmap (f . g) == rmap f . rmap g

(Strictly we probably ought to call these 'endoprofunctors' as we're only really dealing with the category of Haskell types and functions.)

There are lots of analogies for thinking about profunctors. For example, some people think of them as generalising functors in the same way that relations generalise functions. More specifically, given a function f:A→B, f associates to each element of A, a single element of B. But if we want f to associate elements of A with elements of B more freely, for example 'mapping' elements of A to multiple elements of B then we instead use a relation which can be written as a function f:A×B→{0,1} where we say xfy iff f(x,y)=1. In this case, profunctors map to Set rather than {0,1}.

A good example is the type constructor (->)

> instance Profunctor (->) where
>   lmap f g = g . f
>   rmap f g = f . g

It's common that the first argument of a profunctor describes how an element related to a type is sucked in, and the second describes what is spit out. a -> b sucks in an a and spits out a b.

Given a function f we can turn it into a relation by saying that xfy iff y=f(x). Similarly we can turn a functor into a profunctor. Given a functor F:C→D we can define a profunctor F^*:C↛D by

> data UpStar f d c = UpStar (d -> f c)
> instance Functor f => Profunctor (UpStar f) where
>   lmap k (UpStar f) = UpStar (f . k)
>   rmap k (UpStar f) = UpStar (fmap k . f)

You may be able to see how the second argument to a profunctor sort of plays a similar role to the return value of a functor, just as the second argument to a relation sometimes plays a rule similar to the return value of a function.

There also an opoosing way to make a profunctor from a functor just as there is with functions and relations:

> data DownStar f d c = DownStar (f d -> c)
> instance Functor f => Profunctor (DownStar f) where
>   lmap k (DownStar f) = DownStar (f . fmap k)
>   rmap k (DownStar f) = DownStar (k . f)

Note that the identity functor gives us something isomorphic to (->) whether you use UpStar or DownStar.

Dinatural transformations
Just as we have natural transformations between functors, we have dinatural transformations between profunctors. My previous definition of dinatural was specialised to a particular case - dinaturals between a profunctor and the constant profunctor.

Firstly, let's think about natural transformations. If F and G are functors, and h is a natural transformation h:F⇒G, then we have that

h . fmap f = fmap f . h

If we think of F and G as containers, then this rule says that a natural transformation relates the structures of the containers, not the contents. So using f to replace the elements with other elements should be invisible to h and hence commute with it.

Something similar happens with dinatural transformations. But this time, instead of relating the argument to a natural transformation to its return result, it instead relates the two arguments to a profunctor.

Given two profunctors, F and G, A dinatural transformation is a polymorphic function of type:

> type Dinatural f g = forall a. f a a -> g a a

but we also want something analogous to the case of natural transformations. We want to express the fact that if phi :: Dinatural F G, then phi doesn't see the elements of F a a or G a a. Here's a way to achieve this. Suppose we have a dinatural transformation:

phi :: Dinatural G F

and a function f :: X -> X' then we can use lmap to apply f on the left or right of F and G. The definition of dinaturals demands that:

rmap f . phi . lmap f = lmap f . phi . rmap f

ie. that we can apply f on the left before applying phi, and then do f on the right, or vice versa, and still get the same result.

I'm not sure but I think that we don't need to check this condition and that just like the case of naturals it just comes as a free theorem.

Composing profunctors
It's easy to see how to compose functors. A functor is a polymorphic function from one type to another. It's not straightforward to compose profunctors. It's tempting to say that a profunctor maps a pair of types to a type so they can be composed like functions. But the original definition says the definition is D^op×C→Set. So as a function it doesn't map back to the category but to Set. For Haskell we replace Set with Hask, the category of Haskell functions and types. So we have Hask^op×Hask→Hask. It's easy invent a scheme to compose these because Hask appears 3 times. But it'd be wrong to exploit this in a general definition applying to many categories because in the proper definition of profunctor we can't assume that a profunctor maps back to the spaces you started with.

We can try composing profunctors by analogy with composing relations. Suppose R and S are relations. If T=S○R is the composition of R and S then xTz if and only if there exists a y such that xRy and ySz. If our relations are on finite sets then we can define T(x,z) = Σ_yR(x,y)S(y,z) where we work in the semiring on {0,1} with 0+0=0, 0+1=1+0=1+1=1 but with the usual product.

There is an analogue of "there exists" in Haskell - the existential type. Remembering that we write Haskell existential types using forall we can define:

> data Compose f g d c = forall a.Compose (f d a) (g a c)

As mentioned above, functors give rise to profunctors. It'd be good if composition of functors were compatible with composition of profunctors. So consider

Compose (UpStar F) (UpStar G)

for some F and G. This is essentially the same as

exists a. (d -> F a, a -> G c)

What can we discover about an element of such a type? It consists of a pair of functions (f, g), but we can't ever extract the individual functions because the type of a has been erased. To get anything meaningful out of g we need to apply it to an a, but we don't have one immediately to hand, after all, we can't even know what a is. But we do have an F a if we can make a d. So we can use fmap to apply g to the result of a. So we can construct fmap g . f :: d -> F (G c). There is no other information we can obtain. So the composition is isomorphic to UpStar of the functorial composition of F and G. Again, we can probably make this a rigorous proof by making use of free theorems, but I haven't figured that out yet.

But there's a catch: I said I wanted a definition that applies to more categories than just Hask. Well we can replace exists a with the coend operator. We also implicitly used the product operation in the constructor Compose so this definition will work in categories with suitable products. Symmetric monodial categories in fact.

Under composition of profunctors, (->) is the identity. At least up to isomorphism. This composition of profunctors is also associative up to isomorphism. Unfortunately the "up to isomorphism" means that we can't make a category out of profunctors in the obvious way. But we can make a bicategory - essentially a category where we have to explicitly track the isomorphisms between things that are equal in ordinary categories.

Profunctors as tensors
Given a profunctor F we can write F i j suggestively as F_i^j. Let's write the composition of F and G as ∃k. F_i^k G_k^j. We can use the Einstein summation convention to automatically 'contract' on pairs of upper and lower indices and write the composition as F_i^k G_k^j. The analogy is even more intriguing when we remember that in tensor notation, the upper indices are covariant indices and the lower ones are contravariant indices. In the case of profunctors, the two arguments act like the arguments to covariant and contravariant functors respectively. Note alse that because (->) is essentially the identity, we have →_i^jF_j^k=F_i^k. So (->) acts like the Kronecker delta. You can read more about this at mathoverflow where it is hinted that this analogy is not yet well understood. Note that we're naturally led to the trace of a profunctor: exists a. F a a.

Arrows as profunctors
The last thing I want to mention is that Hughes' Arrows are profunctors. There is an intuition that fits. If A is an Arrow, we often think of A d c as consuming something related to type d and emitting something related to type c. The same goes for profunctors. The full paper explaining this is Asada and Hasuo's Categorifying Computations into Components via Arrows as Profunctors with the profunctorial definition of Arrows given as Definition 3.2 (though that definition also appears in some earlier papers.)

Update: One very last thing
When I wrote this article I didn't know how useful this type class would be. So it's neat to see, 7 years later, that profunctors appear in the source code to Minecraft.

Saturday, July 16, 2011

The Infinitude of the Primes

Introduction

> import Data.List hiding (intersect)

A colleague at work reminded me of Fürstenberg's topological proof of the infinitude of primes. But as I tried to argue a while back, topology is really a kind of logic. So Fürstenberg's proof should unpack into a sort of logical proof. In fact, I'm going to unpack it into what I'll call the PLT proof of the infinitude of the primes. I apologise in advance that I'm just going to present the unpacked proof, and not how I got there from Fürstenberg's.

A small formal language
We're going to start with a little language. Propositions of this language are of type Prop:

> data Prop = Modulo Integer Integer

The intention is that Modulo k n is the property of an integer being equal to k modulo n. More precisely, it represents the property of being writable in the form sn+k for some s. (We disallow n=0.) But I also want to allow people to combine properties using "and" and "or". So we extend the language with:

>           | Or [Prop] | And [Prop] deriving Show

The intention now is that And ... holds when all of the properties in the list hold and similarly for Or .... We can write an interpreter to test whether integers have the specified property:

> eval (Modulo k n) x = (x-k) `mod` n == 0
> eval (Or ps)      x = or $ map (\p -> eval p x) ps
> eval (And ps)     x = and $ map (\p -> eval p x) ps

(Note we limit ourselves to *finite* compositions of And and Or, otherwise eval wouldn't actually define a property due to non-termination.

There are lots of things we can say in our language. For example we can give the 'extreme' properties that are never true or always true:

> never = Or []
> always = And []

We can say that one number is divisible by another:

> divisibleBy k = Modulo 0 k

We can test it with expressions like:

*Main> eval (divisibleBy 3) 9
True
*Main> eval (divisibleBy 5) 11
False

We can also express non-divisibility. We say that n isn't divisble by k by saying that n is either 1, 2, ..., or k-1 modulo k:

> notDivisibleBy n =
>     let n' = abs n
>     in Or (map (\i -> Modulo i n') [1..(n'-1)])

(Disallowing n=0.)

*Main> eval (notDivisibleBy 3) 9
False

Eliminating And
It's not obvious at first sight, but there is a big redundancy in our language. There is no need for And. Consider And [Modulo k1 n1, Modulo k2 n2]. This asserts, for the number x, that x = s*n1+k1 and x = t*n2+k2. The Chinese remainder theorem tells us that either these have no solution, or that this pair of propositions is equivalent to one of the form x = k3 mod n3 for some k3 and n3. So every time we And a pair of propositions we can eliminate the And by using the theorem. Solving for k3 and n3 is straightforward. I use the extended Euclidean algorithm and the proof of the Chinese remainder theorem given at Cut the Knot.

> egcd n1 n2 | n2 == 0 = (1, 0, n1)
>            | n1 == 0 = (0, 1, n2)
>            | otherwise = (y, x-y*(n1 `quot` n2), g)
>            where (x, y, g) = egcd n2 (n1 `mod` n2)

> intersect (Modulo k1 n1) (Modulo k2 n2) =
>     let (s, _, g) = egcd n1 n2
>         (q, r) = (k2-k1) `quotRem` g
>     in if r == 0
>         then Modulo (q*s*n1+k1) (n1*n2 `quot` g)
>         else never

So now we can repeatedly use intersect pairwise on our properties to eliminate all uses of And. Here is some code to do so. Firstly, it's convenient to sometimes write any property as if it is a list of "subproperties", all Orred together:

> subproperties (Or ps) = ps
> subproperties p = [p]

Now we can go ahead and remove all of the Ands:

> removeAnd (Or ps) = Or (map removeAnd ps)

The property always can be rewritten as:

> removeAnd (And []) = Modulo 0 1

Remove And from the head of the list, remove it from the tail of the list, and then form all possible intersections of these two parts:

> removeAnd (And (p:ps)) = Or [q `intersect` q' |
>     q <- subproperties (removeAnd p),
>     q' <- subproperties (removeAnd (And ps))]
> removeAnd p = p

By induction, the return value from removeAnd can no longer contain an And. Note that the properties can grow in size considerably. Here is the proposition that x isn't divisble by 5 or 7 written out in full:

*Main> removeAnd (And [notDivisibleBy 5, notDivisibleBy 7])
Or [Modulo 1 35,Modulo 16 35,Modulo 31 35,Modulo 46 35,Modulo 61
35,Modulo 76 35,Modulo (-13) 35,Modulo 2 35,Modulo 17 35,Modulo 32
35,Modulo 47 35,Modulo 62 35,Modulo (-27) 35,Modulo (-12) 35,Modulo
3 35,Modulo 18 35,Modulo 33 35,Modulo 48 35,Modulo (-41) 35,Modulo
(-26) 35,Modulo (-11) 35,Modulo 4 35,Modulo 19 35,Modulo 34 35]

Now to the primes. Here's a standard way to make the list of primes in Haskell:

> isPrime primes n = foldr (\p r -> p*p > n || (rem n p /= 0 && r))
>                          True primes
> primes = 2 : filter (isPrime primes) [3..]

The Proof
Now we can give the proof this set is infinite. Suppose it were finite. Then we could form this property:

> prop = removeAnd $ And (map notDivisibleBy primes)

It contains no Ands, and so must simply be the Or of a bunch of Modulos. But each Modulo defines an infinite set, so prop must define an infinite set.

But prop is the property of not being divisible by any prime. So prop can only eval to True on -1 or 1, a finite set. Contradiction. Therefore primes is infinite.

We can look at approximations to prop like this:

> prop' n = removeAnd $ And (map notDivisibleBy (take n primes))

You can see that the proposition grows in size rapidly:

*Main> removeAnd (prop' 3)
Or [Modulo 1 30,Modulo (-83) 30,Modulo (-167) 30,Modulo (-251)
30,Modulo 71 30,Modulo (-13) 30,Modulo (-97) 30,Modulo (-181) 30]
*Main> removeAnd (prop' 4)
Or [Modulo 1 210,Modulo (-56159) 210,Modulo (-112319) 210,Modulo...]

Nonetheless, it would always be finite if there were only finitely many primes. As primes is infinite, you can think of the sequence prop' n as somehow trying to creep up on the set -1, 1, never quite getting there.

Unfortunately I have no time to explain why a topological proof should lead to one about a simple DSL beyond mentioning that there's a deeper story relating to the computability of eval for possibly infinite expressions of type Prop.

Addendum
I'll just say a little on the computability connection. Suppose we have a really dumb algorithm to test whether an integer x equals k mod n by doing a brute force search for s such that x=s*n+k. Suppose this is the only kind of test on x that we have available to us. The test will only terminate if it finds a solution. So with such an algorithm, testing for equality mod N is only semi-decidable. Now suppose we are allowed multi-threaded code. The infinitude of the primes implies that with our dumb tests, membership of -1,1 is also semi-decidable. So we can turn the problem of proving the infinitude of the primes into one about computability. You can see roughly how: we can "semi-test" Or ps by launching a process to test the first element of ps. Then launch a process to check the next, and so on. If any of these processes terminates, we have our answer. The argument presented above gives the details of how to construct a suitable Or ps.

Saturday, June 25, 2011

An elementary way to approach Fourier transforms

Why another introduction to the Fourier transform?
There are many elementary introductions to the discrete Fourier transform on the web. But this one is going to be a little different. I hope to motivate and demonstrate an application of the Fourier transform starting from no knowledge of the subject at all. But apart from this sentence, I'll make no mention of complex numbers or trigonometric functions. That may seem weird - the standard definitions seem to make it clear that these concepts are crucial parts of the definition. But I claim that in some ways these definitions miss the point. An analogy from software engineering is appropriate: most definitions of the Fourier transform are a bit like explaining an interface to an API by reference to the implementation. That might tell you how it works at a nuts-and-bolts level, but that can obscure what the API is actually for.

There's code all the way through so if anything I've said is ambiguous, that should help resolve it. I've chosen to write the following in 'literate Octave'. I can't say I love the language but it seems like the easiest way to express what I want here computationally.

Convolution
Suppose your camera has a hexagonal iris and it's out of focus. When you take a picture of a point light source (towards your lower left) you'll end up with a result like this:

The effect is known as bokeh and the wikipedia page has a nice example with an octagonal iris.

(If you want to use the Octave code in this article, save the image above as hexagon.png.)

Suppose we take a picture of an ordinary scene instead. We can think of every visible point in the scene as a point light source and so the out-of-focus photograph will be the sum of many hexagons, one for each point in the image, with each hexagon's brightness determined by the brightness of the point it comes from. You can also flip this around and think about the out-of-focus image as being a sum of lots of copies of the original image, one of each point in the hexagon. We can go ahead and code this up directly in octave.

Here's the picture that we'll apply bokeh to:

(Background: I used to work for a division of Kodak and that was the test image we frequently used. Save the image as marcie1.png.)

Let's read in our image and iris shape, converting the image to grayscale:

> I = double(imread('marcie1.png'))/256;
> I = (I(:,:,1)+I(:,:,2)+I(:,:,3))/3;
> J = double(imread('hexagon.png'))/256;


> h = size(I)(1);
> w = size(I)(2);

Now we'll apply the bokeh by looping over the entire iris, accumulating shifted copies of the original image. We're optimising a little bit. As many of the pixels in J are black we can skip over them. I'm using circshift so that we'll get wraparound at the edge. That turns out to be very convenient later.

> total = 0;
> K = zeros(h, w);
> for i = 1:h
>     i
>     for j = 1:w
>         if J(i, j) != 0
>             K = K + J(i, j)*circshift(I, [i, j]);
>             total = total + J(i, j);
>         endif
>     endfor
> endfor

We use total to scale the overall brightness back into a reasonable range:

> imshow(K/total)
> pause(5)

Make sure you understand what that code is doing because the rest of this article depends on it. The central line in the double loop is repeatedly adding copies of the original image, shifted by [i, j], and scaled by J(i, j).

There's just one problem with that code. It's incredibly slow. It's only tolerable because I know that most of J is zero so I could optimize it with a conditional. More general higher resolution images will leave you waiting for a long time.

The image we have computed above is known as the convolution of I and J. My goal is to show how we can use the Fourier transform to look at this in a very different way. As a side effect we will also get a much faster convolution algorithm - but the reason it runs faster is a story for another time. In this article just want to show what Fourier transforms are and why they're relevant at all.

Fourier transforms
Central to the code above is the fact that we're shifting and adding the same image over and over again. If we could avoid that we might find a way to speed the code up. So let me define some shorthand. I'll use R to represent the function that shifts an image right one pixel. In Octave that's given by circshift(I,[0,1]). I'll use U to mean a shift up by one pixel. Rather than define *the* Fourier transform I'm going to define a family of operations, all called Fourier transforms. (More precisely, these are two-dimensional discrete Fourier transforms.)

(1) A Fourier transform is a linear operation that converts an image to another one of the same dimensions. This means that if you apply it to sums and multiples of images you get sums and multiples of the Fourier transforms. In the language of Octave, if F is such an operation, then F(A+B) == F(A)+F(B) and F(s*A) == s*F(A), for s a scalar.

(2) A Fourier transform has this property: there is a pair of images, A and B, such that for any image I, F(U(I)) = AF(I) and F(R(I)) = BF(I). (AF(I) means the pixelwise product of A and F(I)). In Octave notation:

F(circshift(I, [1,0])) == A .* F(I)
F(circshift(I, [0,1])) == B .* F(I)

Fourier transforms convert shifts to multiplications. (If you only learn one thing from this article, this should be it.)

(3) Fourier transforms are invertible so there must be another linear transform F^-1 such that F^-1(F(I)) = I and F(F^-1(I)) = I.

Anything with these properties is an example of a Fourier transform. It's the second property that is crucially important. In jargon it's said to diagonalise translation.

From (2) it follows that we can compute the Fourier transform of an image shifted by [i, j] by multiplying the original Fourier transform by A.^i .* B.^j. (.^ is the pixelwise power function.)

If h is the image height, then shifting up h times should wrap us around to the beginning again. Similarly for w. So from (2) we know that A^h=1 and B^w=1. (That's shorthand for saying that each of the individual elements of A raised to the power of h give 1 and so on.)

It just so happens that Octave comes supplied with a suitable function that satisfies the three conditions I listed. It's called fft2. Let's find out what the corresponding images A and B are:

> r = rand(h, w);
> A = fft2(circshift(r, [1, 0])) ./ fft2(r);
> B = fft2(circshift(r, [0, 1])) ./ fft2(r);

(Note that ./ is the pixelwise division operator. If you're unlucky your random numbers will lead to a division by zero. Just try again.)

Let's try again for another random image:

> s = rand(h, w);
> A0 = fft2(circshift(s, [1, 0])) ./ fft2(s);
> B0 = fft2(circshift(s, [0, 1])) ./ fft2(s);

We can see that A is almost exactly A0 and B is almost B0:

> abs(A-A0)
> abs(B-B0)

So now we can go back to our original convolution algorithm. Let's define

> II = fft2(I);

And now we can use the first and third properties to compute the Fourier transform of the image with bokeh applied to it. We're applying properties (1) and (2) to the central line of the double loop above:

> KK = zeros(h, w);
> for i = 1:h
>     i
>     for j = 1:w
>         if J(i, j) != 0
>             KK = KK + J(i, j).*A.^i.*B.^j.*II;
>         endif
>     endfor
> endfor

I said that the definition of Fourier transform requires the existence of an inverse, and claimed that fft2 was a Fourier transform. So we must have an inverse. Octave conveniently provides us with the name ifft2:

> K = ifft2(KK/total);
> imshow(K)
> pause(5)

We've eliminated all that shifting, but at the end of the day this code is slower. But did you notice something curious about the innermost line of the double loop? It's always adding the same multiple of II to KK. We can completely factor it out. So we can rewrite the code as:

> JJ = zeros(h, w);
> for i = 1:h
>     i
>     for j = 1:w
>         if J(i, j) != 0
>             JJ = JJ + J(i, j).*A.^i.*B.^j;
>         endif
>     endfor
> endfor

We can leave the multiplication by II all the way to the end:

> KK = JJ .* II;
> K = ifft2(KK/total);
> imshow(K)
> pause(5)

This is pretty cool. We can precompute JJ. Any time we want to convolve an image with the hexagon we apply fft2 to the image, multiply by JJ and then apply ifft2. But there's something even better going on. Let's look more closely at that double loop above involving powers of the elements of A and B. Let's write out the function it computes in standard mathematical notation:

f(J) = Σ_i,jAⁱ B^j J_i,j

What is f(U(J))?

f(U(J)) = ΣAⁱ B^j J_i-1,j (dy definition of shifting up)
f(U(J)) = ΣAⁱ⁺¹ B^j J_i,j (by summing over i+1 instead of i and using wraparound)
f(U(J)) = A f(J)

Similarly,

f(R(J)) = B f(J)

In other words, f satisfies the second property of Fourier transforms. It obviously satisfies the first property. Our transform f looks like a Fourier transform. In fact, it is, and Octave's fft2 is defined this way. So now we have this procedure for convolving:

> II = fft2(I);
> JJ = fft2(J);
> KK = II .* JJ;
> K = ifft2(KK/total);

> imshow(K)
> pause(5)

We now have a fast convolution algorithm. Of course I've left out an important point: fft2 and ifft2 are fast. They don't use the obvious summation algorithm suggested by my definition of f. But that's an implementation detail. We're able to reason successfully about important properties of fft2 using the properties I listed above.

Conclusion
Let me recapitulate so you can see

1. I defined Fourier transforms
2. I showed how convolution could be rewritten using one
3. I told you that fft2 was a Fourier transform, giving an alternative algorithm for convolution
4. I showed that another part of the convolution algorithm looks like a Fourier transform. (I didn't prove it had an inverse.)
5. I told you that (by amazing coincidence!) this other part is also fft2.

I haven't shown you how to implement a Fourier transform. But I have shown you how you can reason about many of their properties. Enough to get much of the way to the convolution theorem.

Approaching Fourier transforms through the properties I listed is common in the more advanced mathematical literature. But for some reason, in the elementary literature people often choose to describe things in a more complicated way. This is true of many things in mathematics.

I hope the above serves as useful motivation when you come to check out a more standard and more complete introduction to Fourier transforms. In particular, now's a good time to try to understand why the usual definition of the discrete Fourier transform satisfies the properties above and so fill in the steps I missed out.

Note
You may need to set up Octave for graphics. On MacOSX I started the X server and used "export GNUTERM=x11" before running Octave.

Saturday, June 04, 2011

Simulating visual artifacts with Fourier optics

The problem

One of the fun things about working in the visual effects industry is the large number of disciplines involved, especially when working for a smaller company where everyone has to do everything. At one point many years ago we did some work on simulating camera artifacts including diffraction spikes. Although the wikipedia page is in the context of astronomy you can get diffraction artifacts with any kind of camera, and you'll get spikes whenever the iris has a polygonal shape. I hope to sketch why later.

I noticed an intriguing question about a diffraction pattern on Quora. (For those not following the link: this photo is a picture of a reflection of the photographer and camera in an LCD screen.) My immediate thought was "how could I have faked this in a render?".

My main interest was in the 'X' shape. And the first question is this: how do I know this is a diffraction effect? It might be that there are specular reflections off periodic structures in the LCD display orthogonal to the arms of the 'X'. This would explain everything with only geometric optics.

There are two big clues I know of that diffraction is at work: we have separation of white light into separate colours. This is typical of diffraction effects. We also see a discrete structure: rays directed along a discrete set of directions in addition to the arms of the 'X'. This is typical of diffraction from periodic microstructures, and again I hope to sketch why later.

So how can we explain the appearance of this photograph?

A simple model

If we consider diffraction effects we need to model light as waves. Let's start by working with just one frequency of light from the camera flash. Let's also suppose that the flash is far enough away from the monitor that the spherical wavefronts can be modelled as plane waves when they meet the monitor. Here's a picture:

The thick black vertical line is the monitor screen. The other vertical lines are the incoming plane waves. And the red arrow shows the direction of motion. Huygen's principle tells us that when the waves reflect off the monitor surface we can treat the entire monitor surface as a collection of points, each of which is emitting spherical waves equally in all directions. If you're unfamiliar with Huygen's principle it may seem wrong to you. When you shine a light at a mirror we get reflection along one particular direction, not in all directions. That's true, but we can view this fact as a consequence of interference between all of the little points on the surface emitting in all directions. Again, that's another thing that will emerge from the calculation below.

The question I now want to answer is this: how much light gets reflected in direction k? I'll assume we're observing things from far enough away that we can neglect the details of the geometry. We'll just compute the contribution in direction k (a unit vector (l, m, n)) from all of the points on our surface. We'll assume that because of variations on the surface, the intensity of the emission varies along the surface. We'll model the emission as a function I, so I(y, t) is the emission from position y at time t. We want to collect the light going in direction k through some light collector at time t. Call that J(k,t). Here's a picture:

Remember: there are waves going in all directions, but I've just drawn the waves going in direction k.

Fourier optics for beginners

Now comes the interesting bit: the light arriving at the collector is the integral of all the light emitted from the points on the monitor surface. But the time of flight from each point on the monitor surface is different. So when we perform our integral we need to build in a suitable delay. It's straightforward geometry to show that the time delay between waves arriving from y₁ and y₂ is my/c where c is the speed of light and m, as defined above, is the y-component of k. So, up to a constant of proportionality, and some absolute choice if time, we want this integral

J(k,t) = ∫ dy I(y,t-my/c)

We assumed that the monitor surface was being struck by plane waves. Assuming they were orthogonal to the surface and coherent this means that I is proportional to exp(iωt). The time dependence in J is just the sinusoid, so we drop that from further calculations. So we can write, ignoring another constant of proportionality,

J(k) = ∫ dy I(y)exp(-iωmy/c) = ∫ dy I(y)exp(-i2πmy/λ)

Where I used ω = 2πc/λ and λ is the wavelength of the light. In other words, J is the Fourier transform of I. More generally, if we modelled the z-axis we'd find that J is given by the 2D Fourier transform of I as a function of the 2D surface of the monitor. The actual power striking the sensor is given by the absolute value of the square of J.

Ordinary reflections

So now I can pop the first of my pending explanations off the stack. Suppose that the surface is completely uniform. Then I(y) = constant. The Fourier transform of a constant function is the dirac delta function. In other words - despite the fact that we're modelling every point on the surface as its own emitter, the resulting reflection from a perfectly smooth surface is a wave going straight back where it came from. (Exercise, modify the above treatment to show that waves striking a mirror at an angle obey the usual reflection law for mirrors.)

Periodic structure

Now suppose that the surface of our monitor has a periodic structure. The Fourier transform of a periodic function is concentrated at spikes corresponding to the fundamental frequency of the structure and its overtones. So we expect to see a grid-like structure in the result. I believe that's what we're seeing in the spray of quantised vectors coming from the center of the image and the fact that the arms of the 'X' look like discrete blobs of colour arranged in a line rather than the white spikes in the astronomical example I linked to above. That tells us we're probably seeing artifacts caused by the microstructure of the LCD pixels.

An edge

Consider the function I(y,z) = -1 for y<0 and 1 for y>0. In other words, the sgn function along the y-axis and constant along the z-axis. Up to multiplication by a constant, its Fourier transform is given by the dirac delta along the z-axis and 1/ω along the y-axis. On other words, it's a spike, starting at the origin, extending along the y-axis, but fading away as we approach infinity. This is the source of diffraction spikes. Any time we see such a spike it's likely there was a perpendicular edge somewhere in the function I. For telescopes it often comes from the struts supporting the secondary mirror. For cameras it comes from the polygonal shape of the iris. In this case, it looks like we must have pixels whose shape has edges perpendicular to the 'X'. I have to admit, when I came to this conclusion it sounded very implausible. But then someone posted this image from here:

There are those edges at the predicted angles.

Simulating the artifact

Now we have enough information to simulate the artifact. You can think of the remainder of this article as written in "literate octave".

Save the above close-up of the pixels in a file and read it in.

I = imread('lcd-pattern.jpg');

I crop out one quarter as it simply repeats. This also removes the text.

I1 = I(251:500,251:500,1);

Now clean up some stray (image) pixels and pick out just one (LCD) pixel.

I1(1:250,100:250) = 0;

The photo is blurry, probably has compression artifacts and seems quite noisy to me. So I threshold it at intensity 200 to get a hard shape. I then compute the discrete Fourier transform and scale the intensity to something nice.

I2 = 0.0001*abs(fft2(I1>200).^2);

The Fourier transform puts the origin in the corner so I shift it to the centre.

I2 = circshift(I2,[125,125]);

Now we need to do the same thing for red and green. We reuse the red image but it needs scaling in proportion to the wavelength. The numbers 1.2 and 1.4 are intended to be the ratio of the wavelengths to that of blue light. They're just ballpark figures: I chose them to make the image sizes convenient when rescaled. I also crop out a piece from the centre of the rescaled images so they line up nicely. I use imresize2 from here.

Ib = I2;

Ig = imresize2(Ir,1.2,'nearest')(26:275, 26:275);

Ir = imresize2(Ir,1.4,'nearest')(51:300, 51:300);

Now I assemble the final RGB image.

J = zeros(250, 250, 3);

J(:, :, 1) = Ir;

J(:, :, 2) = Ig;

J(:, :, 3) = Ib;

imwrite(J,'cross.jpg')

Here's the result:

Analysis

Not bad, but it's not perfect. That's unsurprising. Have a look at the thresholded image straight after thresholding. It's a very poor representation of the pixel shape. It would be better to redraw the shape correctly at a higher resolution (or even work analytically, not infeasible for simple shapes like these pixels). Stray (image) pixels can make a big difference in Fourier space. That explains the 'dirtiness' of my image. Nonetheless, it has the big X and it has the rows of dots near the middle. Qualitatively it's doing the right thing.

Note that the big X in the middle seems to have ghost reflections at the left and right. These are, I think, aliasing and purely artifacts of the digitisation.

It'd look slightly better with some multi-spectral rendering. I've treated the pixels as reflecting one wavelength but actually each one reflects a band.

There are also, almost certainly, lots more effects going in that picture. The close-up photograph is only revealing one part of the structure. I'm sure there is 3D structure to those pixels, not just a flat surface. I suspect that's where the horizontal white line is coming from. Because it's white it suggests an artifact due to something interacting equally with all visible wavelengths. Maybe a vertical edge between pixels.

But overall I think I've showed that Fourier optics is a step in the right direction for simulating this kind of effect.

Note

Everything I know about Fourier optics I learnt from my ex-colleague Oliver James.

All the errors are mine of course.

Saturday, May 28, 2011

Fast forwarding lrand48()

A break from abstract nonsense to answer a question I've seen asked online a number of times. It requires nothing more than elementary modular arithmetic and it ends in some exercises.

Given a pseudo-random number generator, say BSD Unix lrand48(), is there a quick way to jump forward a billion numbers in the sequence, say, without having to work through all of the intermediate numbers? The method is no secret, but I couldn't find explicit code online so I thought I'd put some here. Literate Haskell of course.

> {-# LANGUAGE ForeignFunctionInterface #-}
> {-# OPTIONS_GHC -fno-warn-missing-methods #-}

On MacOSX, if you type 'man lrand48', you'll see the function lrand48() returns a sequence of 31 bit non-negative integers defined using the sequence r_n+1 = ar_n+c mod m where

> a = 25214903917
> c = 11
> m = 2^48

The actual returned value is the floor of r_n/2¹⁷ and r₀ = 20017429951246.

We can compute the nth element in the sequence the hard way by importing lrand48 and looping n times:

> foreign import ccall "lrand48" lrand48 :: IO Int

> nthrand 1 = lrand48
> nthrand n = lrand48 >> nthrand (n-1)

But there is a better way. If we iterate twice we get that r_n+2 = a(ar_n+c)+c mod m = a²r_n+ac+c mod m. Note how two applications of the iteration give you back another iteration in the same form: a multiplication followed by an addition modulo m. We can abstract this a bit. Given two function f(x) = ax+c mod m and g(x) = a'x+c' mod m we get g(f(x)) = (a'*a)*x + a'*c+c' mod m. We can represent functions of this type using a simple Haskell type:

> data Affine = Affine { multiply :: Integer, add :: Integer } deriving (Show, Eq, Ord)

We can now write a function to compose these functions. I'm going to use the operator * to represent composition:

> instance Num Affine where
>    Affine a' c' * Affine a c = Affine (a'*a `mod` m) ((a'*c+c') `mod` m)

To skip forward n steps we just need to multiply n of these together, ie. raise Affine a c to the power of n using ^. We then need to apply this function to r₀:

> initial = Affine 0 20017429951246

> nthrand' n = (add $ Affine a c ^ n * initial) `div` (2^17)

Now try firing up ghci and comparing the outputs of nthrand 1000000 and nthrand' 1000000. Don't run nthrand more than once without resetting the seed, eg. by restarting ghci. (I know someone will post a reply below that it doesn't work...)

There are lots of papers on how to do this with other kinds of random number generator. My example is probably the easiest. The main application I can see is for jumping straight to that annoying regression test failure without going through all of the intermediates.

Exercises.
1. Read the corresponding man page for Linux. Port the above code to work there. Or any other OS you feel like. Or any other random number generator.
2. Can you split lrand48() into two? Ie. can you make two random generators that produce sequences s_i and t_i so that s₀, t₀, s₁, t₁, ... form the sequence given by lrand48().
3. I've neglected to mention some special sauce in the code above. Why does it actually run so fast? (Clue: why did I use Num?)

Saturday, April 30, 2011

Perturbation confusion confusion

Update: I'm making a bit of a turnabout here.

Firstly, I have to point out that at no point have I disagreed with S&P on any purely technical issue. We're all working with the same code and agree on what it does. The issue is a human one: is it possible to get wrong results by writing code that *looks* correct. I was sent this example:

d (\x -> (d (x*) 2)) 1

Read naively it produces a different result to what you might expect. We can see why it fails by looking at this:

d (\x -> (d (x*) 2)) (1::Integer)

It fails to typecheck! That "1" is actually of type D Integer. So the semantics of that code are entirely different to what you might expect if you read it like an ordinary mathematical expression and ignored the types of all the subexpressions.

So I agree that it is possible for a programmer to misread that code. I still don't consider the algorithm to have failed in this case. It is standard in Haskell that the appearance of an integer in code doesn't necessarily mean it's of Integer type, so that when we write Haskell code we always need to be aware of the types of all of our terms. When we write d (\x -> (d (x*) 2)) 1, we're asking for the wrong thing before the code has started executing. But I have been convinced that there are dangers here for people reading the code naively.

However I am now completely convinced the situation is more complex than this and I'll have to address it again.

Anyway, I suggest using AD regardless. It's an awesomely powerful technique. But don't capture variables in a lambda or function without wrapping them immediately them in a lift, and if you multiply nest, you need to multiply lift. Essentially lift is a signal to the compiler that you desire the variable to be held constant with respect to the derivative. It also makes the code more readable. It's analogous to using fmap the correct number of times in order to push a function down into nested lists.

So I'll leave the following in place because removing it would be bad form and some of it still stands.

Introduction
In a recent paper, Siskind and Pearlmutter warn how the use of differentiation operators in functional programming languages is "fraught with danger" and discuss a problem common to "all attempts to integrate a forward-mode AD operator into Haskell".

This is curious. I have had great success using automatic differentiation code both in Haskell and functional-style C++ and failed to notice the danger. Clearly I needed to take better notice of what I was doing.

So let's go ahead and implement AD and try to reproduce the problem they point out.

Automatic Differentiation

> data D a = D { real :: a, infinitesimal :: a } deriving (Eq, Show)

> instance Num a => Num (D a) where
>   fromInteger n = D (fromInteger n) 0
>   D a a'+D b b' = D (a+b) (a'+b')
>   D a a'*D b b' = D (a*b) (a*b'+a'*b)

We can now define a differentiation operator:

> d f x = infinitesimal (f (D x 1))

We can use d to differentiate a function like f(x) = x³+2x²+x+1 at 2 to get:

> example0 =  d (\x -> x^3+2*x^2+x+1) 2

Imagine you were confused enough by AD to write Siskind and Pearlmutter's example exactly as described in equation (2) of the paper:

example1 =  d (\x -> x*(d (\y -> x+y) 1)) 1

We don't get an incorrect result. Instead, we get this error message:

Occurs check: cannot construct the infinite type: a0 = D a0
Expected type: D (D a0)
  Actual type: D a0
In the first argument of `(+)', namely `x'
In the expression: x + y

The Haskell type checker identifies precisely where the problem is. x and y don't have the same types so they can't be added. Rather than immediately analyse how to fix this, let's try a completely different approach to calculus: symbolic differentiation. We'll define an expression type and write code to differentiate it

Symbolic Differentiation

> data E a = X | Const a | E a :+: E a | E a :*: E a deriving (Eq, Show)
> diff X = 1
> diff (Const _) = 0
> diff (a :+: b) = diff a + diff b
> diff (a :*: b) = a:*: diff b + diff a :*: b

We want to be able to evaluate these expressions:

> eval X x = x
> eval (Const a) x = a
> eval (a :+: b) x = eval a x + eval b x
> eval (a :*: b) x = eval a x * eval b x

We can make this easier to use by making E a an instance of Num:

> instance Num a => Num (E a) where
>   fromInteger n = Const (fromInteger n)
>   a + b = a :+: b
>   a * b = a :*: b

And now we can write an alternative to d:

> d' f x = eval (diff (f X)) x
> example1 =  d' (\x -> x^3+2*x^2+x+1) 2

We of course get the same result.

So let's try the example from the paper again:

example1 =  d' (\x -> x*(d' (\y -> x+y) 1)) 1

We don't get an incorrect result. Instead, we get this error message:

Occurs check: cannot construct the infinite type: t0 = E t0
Expected type: E (E t0)
  Actual type: E t0
In the first argument of `(+)', namely `x'
In the expression: x + y

An almost identical error message. So what's going on?

Look at the type signature to d':

d' :: Num t => (E a -> E t) -> t -> t

When we evaluate

d' (\x -> x*(d' (\y -> x+y) 1)) 1

the outer d' differentiates symbolically so internally it uses an expression type. This means that x is an expression, not a numerical value. But this means that the inner d' is being asked to evaluate its result at a value that is itself an expression, not a value. So internally it uses expressions of expressions. So x is of type E a and y is of type E (E a). It's no surprise we can't add them. Our bug is easily fixed. We define a lifting function:

> lift' x = Const x

and now we can correctly evaluate:

> example2 =  d' (\x -> x*(d' (\y -> lift' x+y) 1)) 1

> lift x = D x 0
> example3 =  d (\x -> x*(d (\y -> lift x+y) 1)) 1

Much the same discussion applies to the AD code. In that case, D a is the type of a's that have been infinitesimally perturbed. The type D (D a) is a value whose perturbation is itself perturbed. But Siskind and Pearlmutter say that we might be subject to perturbation confusion and could get 2. Curious. There is no way we can 'confuse' these distinct perturbations. Not no how. Not no way. They correspond to data of entirely different types. Far from being fraught with danger, the Haskell type system keeps us safe. Perturbation confusion in Haskell is about as likely as expression tree confusion.

Let's take a look at the analysis in section 2. They introduce a perturbation ε and show how we might have computed 2 instead of 1. But this derivation doesn't correspond to any computation we could possibly have made using the function d in Haskell. The only issue we have identified is that we have to write our code using correct types. This has nothing to do with automatic differentiation or perturbations. It applies equally well to the symbolic differentiation code. In fact, it has nothing to do with differentiation as it applies equally well to my code to compute the even and odd parts of a function.

We can press on with the paper. Section three tries to sell us a remedy - tagging. But we have no need for tagging. The Haskell compiler already deduced that the perturbation inside the inner d is of a different type to that in the outer d. The only explanation I can come up with for this section is that the authors have some experience with functional programming languages that are dynamically typed and are trying to apply that experience to Haskell.

Section 4 reiterates the point in the abstract that implementations of AD "fail to preserve referential transparency". I think we can safely ignore this claim. The AD code above isn't using any unsafe Haskell operations. It clearly is referentially transparent.

Confusion lifted
So now to section 6. They talk about the lift function and how correctly inserting it requires "sophisticated non-local analysis". Now everything becomes clear. Siskind and Pearlmutter don't consider this algorithm to be "automatic differentiation" unless they can leave out the lift operations. But it is not common practice to write Haskell code by deliberately writing a significant body of code that doesn't type check, and then expect to automatically insert the missing pieces. You simply write the code correctly to start with. In fact, when I first found myself inserting a lift in my own code I didn't even think of myself as having solved a problem - I just wrote the code that was analogous to code that Haskell programmers write every day. If I had forgotten the lift, the compiler would have set me straight anyway.

Conclusion
"Perturbation confusion" is a non-existent problem in Haskell AD. Siskind and Pearlmutter correctly point out that Haskell code using AD is not exactly analogous to mathematical notation for derivatives. This is unsurprising, differentiation is not computable (for deep reasons reasons that are similar to the reason that intermediate values are not computable). But functions like lift are an everyday part of typed functional programming. This is not a feature of AD, it applies equally as well to symbolic differentiation (and indeed many other parts of Haskell such as using monad transformers and nested functors). The algorithm is no less a form of AD because of the use of lift.

I have been advocating AD for many applications over recent years. It has been an uphill struggle. It is often incorrectly assumed that I am talking about numerical or symbolic differentiation. Programmers will often assert that what I am describing is completely impossible. But when they grasp what I am explaining the response often changes to "why didn't anyone tell me about this years ago?" It is disappointing to see two major contributors to computer science whose work I greatly respect scaring away potential users with FUD about a non-existent problem.

(And if you got here but skipped the update at the top you really need to go to the top as I now think Siskind and Pearlmutter have a good point! I'm leaving the above just for the sake of posterity.)

Saturday, April 09, 2011

Image-based rendering and some ancient history

Introduction
I've not said much about my work in visual effects in this blog. This is mainly because I try very carefully to avoid any kind of intellectual property conflict with my employer. But now I've left the world of visual effects to work at Google, I think I might throw in the occasional article. So here's something about my own work from many years ago. In particular, back in the year 2000 I and my colleagues George and Kim received an Academy Scientific and Technical Achievement Award for "the development of a system for image-based rendering allowing choreographed camera movements through computer graphic reconstructed sets." I rarely tell people what this work was for. But now I'll have a convenient web page to which I can direct anyone who asks. I'm hoping to aim this at people who know a little mathematics and geometry, but not necessarily anything about computer graphics.

Many years ago I worked at a company called Manex Visual Effects. One day the story of that company needs to be told. Here are some things that have been alleged about Manex that I neither confirm nor deny: its stock was traded in one of the largest ever pump-and-dump scams in the Far East, it extracted millions of dollars from Trenton, New Jersey for a fake scheme to create an East Coast "Hollywood", and it spent a couple of years making press releases about how it was doing work on the Matrix sequels despite the fact that there was nobody at the company who was actually doing work on the movies, including making press releases quoting me describing the ongoing work long after I had left. At one point I was apparently being trailed by a private detective who even came rowing a boat past the back of my house in order to get pictures of me. I haven't checked the fine print, but I think the contracts that prevented me from speaking about any of this expired long ago.

But back around 1998-2000, when Manex was doing real effects work, we developed a pipeline for a technique known as image based rendering. We became adept at taking large numbers of photographs of locations and then reproducing those locations as photorealistic renders. When I say photorealistic here I don't mean something that's merely supposed to look real, but actually looks fake. I mean renders that were indistinguishable from photography. That's commonplace now but back in 2000 it was a challenge, especially on a large scale.

The last ten seconds of this video clip from MI:2 should give some idea of what I'm talking about. The city around Tom Cruise as he jumps from the building is entirely CGI, but using real photography:

Texturing Cities
Although MI:2 isn't set in Sydney, that is the location that was used. A team returned with many hundreds, if not thousands of photographs of the central business district. We also managed to construct 3D models of these buildings by various means. We started by using Façade to construct the geometry, but ultimately we used a variety of methods including buying 3D models from, I think, the city itself. In later years I completely replaced the way we built geometry so we didn't need to use any other source.

The goal was to map the photography onto the models. Here's a picture illustrating three points being mapped onto a building. To render correctly we needed every visible point on every 3D object to be coloured (correct terminology: textured) using a pixel from one of the photographs.

The Matrix
We're ready for a theorem. Let P be a photograph of a scene S. Let (u, v) be ordinary rectilinear coordinates in P. Let (x, y, z) be 3D coordinates in the scene S. Define proj(u, v, w) = (u/w, v/w). Then there is a 3×4 matrix M such that for every point (x, y, z) visible in S, its colour is given by the colour of the point with coordinates proj(M(x,y,z,1)^T) in P.

(This assumes a pinhole projection model for the camera. In fact, real cameras have lens distortion. I wrote software to measure and fix this.)

The important point is that for each camera we just needed one matrix. We generated it by a very simple scheme: we had artists mark correspondences between points on our 3D models and points in the photography. I implemented a least squares solver to find the matrix that best fit each view. Pity the artists. They came fresh from college with well developed 3d modelling skills and I was responsible for reducing their careers to this simple point and click operation. But don't feel too bad. Most of these artists have since gone on to very successful careers in visual effects. They were a really great team.

But just mapping a single photograph is no good. The theorem tells us what to do for each point visible in a photograph, but how do we select which photograph to use? When producing the final render for production we had the luxury of being able to allow hours to render a single frame. We could make the decision on a pixel by pixel basis by performing a test for visibility from each camera, and then using heuristics to make the selection. (My colleague George had already proved this worked for the backgrounds to most of the bullet-time shots from The Matrix.)

But that was no good for the artists. They needed to see their work as they were doing it. If we could get the render time down to seconds it would be good. In fact, we got it down to a fraction of a second.

The Splitter
The first thing to note is that if we render a single triangle then the mapping defined by the theorem above is provided by modern graphics hardware. Back in 2000 it wasn't universally available, but it was in the SGI hardware we used. So we only needed to solve the camera selection problem fast. The solution was simple, we'd preprocess the geometry of the scene by splitting it into pieces (reducible to triangles), each piece corresponding a a part visible from one camera. Debevec et al. had already given demos of a technique for approximately partitioning a scene between cameras like this, but it didn't have the quality we wanted.

(By the way, my career in graphics started when I figured out an algorithm for eliminating the division from the proj() function and my new boss-to-be noticed I'd posted it online.)

There were three things needed:

1. For each camera, we needed to compute the subscene that was the intersection of the scene S with the volume viewed by the camera, its frustum.

2. For each subscene we needed to remove any parts occluded from view by any other part of the scene.

3. Ensuring that no two subscenes overlapped - ie. ensuring that no part of any subscene is visible in more than one camera.

It seemed like a messy computational geometry problem. But after a little thought it dawned on me that most of the geometry work, at least as measured by computation time, could be factored into one tiny geometry routine and the rest was logic. Here's a picture of the algorithm:

It takes a convex polygon and slices it using a plane (indicated in grey). Now everything else follows. For example, step 1 above can be achieved like this: the frustum associated with a camera is a 6-sided convex polyhedron (or a 4- or 5-sided polyhedron if you want it to extend all the way to infinity and/or all the way to the point of projection in the camera.) We can decide which points are in the frustum by slicing using the 6 planes and keeping the pieces that fall inside.

Step 2 works a little like this:

There is a 5-sided polygon lying in front of a box casting a shadow. The shadow volume is itself a 6-sided frustum (with a seventh side "at infinity", so to speak). So to remove the parts shadowed by a polygon we use the same slicing algorithm and collect up all the pieces that fall outside of the shadow volume. To remove all of the scene that is occluded from view by this camera we simply remove shadow volume corresponding to every single polygon in the scene. One of the virtues of image based rendering is that the detail in the photography makes up for the simplicity of the geometry, keeping the polygon count low. So the total number of operations might not seem too bad. Unfortunately, every time you slice through the scene you risk doubling the total number of polygons. It could have taken worse than exponential time in the number of polygons. But I took the risk, and surprisingly the time to preprocess was measured in minutes for typical scenes. The individual pieces of geometry were very intricate due to the fact that occlusion from any object could carve into a building. But like a jigsaw, the union of all the pieces gave something that looked just like a real city.

The last stage was ensuring that each part is visible from one camera. This was straightforward. As every polygon was processed with respect to every camera then I could associate to every polygon a list of cameras in which it was visible. At the end I could just sweep through and pick the best camera based on a heuristic. Typically we wanted to use the camera with the most undistorted detail.

Along the way I had to apply a few more heuristics to make things manageable. Many thin slivers would appear in the slicing. If they were small enough I threw them away. I'd also sweep through from time to time and fuse together neighbouring polygons that had been sliced but ended up still being visible in the same cameras, and whose union was still convex. That would reduce the total polygon count and speed up further processing.

It worked. The artists would mark their correspondences, run the optimiser to extract the per-camera matrices, kick off the 'splitter', have a coffee, and then return to a fully interactive view of Sydney, or wherever. They could now examine it from all angles and quickly ascertain what further work was needed, eg. if there were holes in one of the buildings. That allowed us to develop an efficient workflow and carry out the work on the scale needed to complete movie shots. We could also interactively visualise how other objects interacted with our cities. We could also use this tool to plan the photography and ensure we had all the coverage we needed.

In retrospect it all seems trivial. But at the time I don't think any other company could churn out entire city blocks the way we could. Today, Google do similar work on a much larger scale, and much more cleverly, with Street View.

And that was just one part of the technique that we marketed as "virtual cinematography" and which won George, Kim and I our award. But it's really important to remember that movie making is a big team effort. It took the work of many dozens of people to create our photorealistic cities, and the fact that I've chosen to write about my contribution doesn't mean that it was the only part that was important. It wouldn't have happened if George hadn't taught us the method originally, and if Kim hadn't believed in us and formed part of the company around our pipeline. And of course nothing wouldn't have happened if the artists didn't politely put up with our crudely engineered tools.

Ackowledgement
The example city image above was derived from a picture by Calvin Teo on wikipedia. I don't have the original photography we used.

A Neighborhood of Infinity

Search This Blog