Saturday, July 31, 2010

Automatic Divided Differences

Divided Differences
I've previously talked about automatic differentiation here a few times. One of the standard arguments for using automatic differentiation is that it is more accurate than numeric differentiation implemented via divided differences. We can approximate f'(x) by using (f(x)-f(y))/(x-y) with a value of y near x. Accuracy requires y to be close to x, and that requires computing the difference between two numbers that are very close. But subtracting close numbers is itself a source of numerical error when working with finite precision. So you're doomed to error no matter how close you choose x and y to be.

However, the accuracy problem with computing divided differences can itself be fixed. In fact, we can adapt the methods behind automatic differentiation to work with divided differences too.

(This paragraph can be skipped. I just want to draw a parallel with what I said here. Firstly I need to correct the title of that article. I should have said it was about *divided differences*, not *finite differences*. The idea in that article was that the notion of a divided difference makes sense for types because for a large class of function you can define divided differences without using either differencing or division. You just need addition and multiplication. That's the same technique I'll be using here. I think it's neat to see the same trick being used in entirely different contexts.)

The Direct Approach
Firstly, here's a first attempt at divided differencing:

> diff0 f x y = (f x - f y)/(x - y)

We can try it on the function f:

> f x = (3*x+1/x)/(x-2/x)

diff0 f 1 1.000001 gives -14.0000350000029. Repeating the calculation with an arbitrary precision package (I used CReal) gives -14.000035000084000. We are getting nowhere near the precision we'd like when working with double precision floating point.

The Indirect Approach
Automatic differentiation used a bunch of properties of differentiation: linearity, the product rule and the chain rule. Similar rules hold for divided differences. First let me introduce some notation. If f is a function then I'll use f(x) for normal function application. But I'll use f[x,y] to mean the divided difference (f(x)-f(y))/(x-y). We have

(f+g)[x,y] = f[x,y]+g[x,y]
(fg)[x,y] = f(x)g[x,y]+f[x,y]g(y)
h[x,y] = f[g(x),g(y)]g[x,y] when h(x)=f(g(x))

We can modify the product rule to make it more symmetrical though it's not strictly necessary:

(fg)[x,y] = 0.5(f(x)+f(y))g[x,y]+0.5f[x,y] (g(x)+g(y))

(I got that from this paper by Kahan.)

In each case, given f evaluated at x and y, and its divided difference at [x, y], and the same for g, we can compute the corresponding quantities for the sum and product of f and g. So we can store f(x), f(y) and f[x,y] together in a single structure:

> data D a = D { fx :: a, fy :: a, fxy :: a } deriving (Eq, Show, Ord)

And now we can implement arithmetic on these structures using the rules above:

> instance Fractional a => Num (D a) where
>    fromInteger n = let m = fromInteger n in D m m 0
>    D fx fy fxy + D gx gy gxy = D (fx+gx) (fy+gy) (fxy+gxy)
>    D fx fy fxy * D gx gy gxy = D (fx*gx) (fy*gy) (0.5*(fxy*(gx+gy) + (fx+fy)*gxy))
>    negate (D fx fy fxy) = D (negate fx) (negate fy) (negate fxy)

I'll leave as an exercise the proof that this formula for division works:

> instance Fractional a => Fractional (D a) where
>    fromRational n = let m = fromRational n in D m m 0
>    D fx fy fxy / D gx gy gxy = D (fx/gx) (fy/gy) (0.5*(fxy*(gx+gy) - (fx+fy)*gxy)/(gx*gy))

For the identity function, i, we have i(x)=x, i(y)=y and i[x,y]=1. So for any x and y, the evaluation of the identity function at x, y and [x,y] is represented as D x y 1. To compute divided differences for any function f making use of addition, subtraction and division we need to simply apply f to D x y 1. We pick off the divided difference from the fxy element of the structure. Here's our replacement for diff0.

> diff1 f x y = fxy $ f (D x y 1)

This is all mimicking the construction for automatic differentiation.

Evaluating diff0 f 1 1.000001 gives -14.000035000083997. Much closer to the result derived using CReal. One neat thing about this is that we have a function that's well defined even in the limit as x tends to y. When we evaluate diff1 f 1 1 we get the derivative of f at 1.

I thought that this was a novel approach but I found it sketched at the end of this paper by Reps and Rall. (Though their sketch is a bit vague so it's not entirely clear what they intend.)

Both the Kahan paper and the Reps and Rall papers give some applications of computing divided diferences this way.

It's not clear how to deal with the standard transcendental functions. They have divided differences that are very complex compared to their derivatives.

There is a sense in which divided differences are uncomputable(!) and that what we've had to do is switch from an extensional description of functions to an intensional description to compute them. I'll write about this some day.

Note that the ideas here can be extended to higher order divided differences and that there are some really nice connections with type theory. I'll try to write about these too.

Update: I found another paper by Reps and Rall that uses precisely the method described here.

Saturday, July 03, 2010

Death to Hydrae (or the operational semantics of ordinals)

Unprovable Propositions
Among other things, Godel's first incompleteness theorem allows us to construct a statement in the language of Peano arithmetic that can't be proved using the axioms of Peano arithmetic. Unfortunately, this statement is a highly contrived proposition whose sole purpose is to be unprovable. People who learn of Godel's theorems often ask if there are other more natural and uncontrived mathematical statements that can't be proved from the Peano axioms.

My goal in this post will be to describe one of these propositions. Not just uncontrived, but actually very useful. I only intend to tell half of the story here because I feel like there are many good treatments already out there that tell the rest. I'm just going to get to the point where I can state the unprovable proposition, and then sketch how it can be proved if you allow yourself a little Set Theory.

> {-# OPTIONS_GHC -fno-warn-missing-methods #-}
> import Prelude hiding ((^))
> infixr 8 ^
> type Natural = Integer

Suppose we implement a function to compute the Fibonacci numbers like so:

> fib 0 = 0
> fib 1 = 1
> fib n = fib (n-2) + fib (n-1)

How do we know that fib terminates for all natural number arguments? One approach is this: if we pass in the argument n it clearly never recurses more than n levels. Each time it recurses it calls itself at most twice. So it must terminate in O(2n) steps (assuming that the primitive operations such as addition take constant time). We can think of this code in a kind of imperative way. It's a bit like n nested loops, each loop going round up to two times.

Suppose instead that we have some kind of recursive function g that goes n levels deep but for which the number of calls of g to itself is no longer two. In fact, suppose the number of self-calls is very large. Even worse, suppose that each time g is called, it calls itself many more times than it did previously, maybe keeping track of this ever growing number through a global variable. Or instead of a global variable, maybe an evil demon decides how many times g calls itself at each stage. Can you still be sure of termination?

A Simple Machine
In order to look at this question, we'll strip a computer right down to the bare minimum. It will have an input (that the evil demon could use) for natural numbers and will output only one symbol. Here's a design for such a machine:

> data Machine = Done | Output Machine | Input (Natural -> Machine)

A value of type Machine represents the state of the machine. Done means it has finished running. Output s means output a symbol and continue in state s. Input f means stop to input a number from the demon (or elsewhere), call it i, and then continue from state f i. This is very much in the style discussed by apfelmus and I in recent blog posts.

Here's an interpreter for one of these machines:

> run1 Done = return ()
> run1 (Output x) = print "*" >> run1 x
> run1 (Input f) = readLn >>= (run1 . f)

For any n we can easily build a machine to output n stars. This is such a natural machine to want to build it seems only right to give it the name n. If we want to do this then we need to make Machine an instance of Num and define fromInteger for it:

> instance Num Machine where
> fromInteger 0 = Done
> fromInteger n = Output (fromInteger (n-1))

Typing run1 8, say, will output 8 stars.

Now given two of these machines there is a natural notion of adding them. a + b is the machine that does everything b does followed by everything a does. (Remember, that's b then a.) To do this we need to dig into b and replace every occurrence of Done in it with a. That way, instead of finishing like b, it leads directly into a. In the case of a + Input f, for each number i we need to dig into f i replacing each Done with a:

> a + Done = a
> a + Output b = Output (a + b)
> a + Input f = Input (\i -> a + f i)

There's a natural way to multiply these machines too. The idea is that in a * b we run machine b. But each time the Output command is run, instead of printing a star it executes a. You can think of this as a control structure. If n is a natural number then a * n means running machine a n times. In the case of a * Input f, instead of multiplying by a fixed natural number, we get an input from the user and multiply by f i instead:

> _ * Done = Done
> a * Output b = a*b + a
> a * Input f = Input (\i -> a * f i)

We can make a machine to input a number and then output that many stars. Here it is:

> w = Input fromInteger

Try running run1 w.

Can you guess what the machine w * w does? Your first guess might be that it inputs two numbers and outputs as many stars as the product of the two numbers. Try it. What actually happens is that we're computing w * Input fromInteger. Immediately from the definition of * we get Input (\i -> w*i). In other words, the first input gives us an input i, and then w is run i times. So if we initially input i, we are then asked for i more inputs and after each input, the corresponding number of stars is output. Although the original expression contains just two occurrences of w, we are required to enter i+1 numbers.

Given the definitions of + and * it seems natural to define the power operation too:

> (^) :: Machine -> Machine -> Machine
> a ^ Done = Output Done
> a ^ Output b = a^b * a
> a ^ Input f = Input (\i -> a ^ f i)

The power operation corresponds to the nesting of loops. So, for example, w ^ n can be thought of loops nested n deep.

Try working out what w ^ w does when executed with run1.

Consider the set M of all machines built using just a finite number of applications of the three operators +, * and ^ to w and the non-zero naturals. (The non-zero condition means we exclude machines like 0*w that accept an input and do nothing with it.) Any such expression can be written as f w, where the definition of f makes no mention of w.

Suppose we use run1 and we always enter the same natural n. Then each occurrence of w acts like n. So if we start with some expression in w, say f w, then always inputting n results in f n stars. We could test this with run1 (w^w^w^w), always entering 2, but it would require a lot of typing. Instead we can write another intepreter that consumes its inputs from a list rather from the user (or demon). And instead of printing stars it simply prints out the total number of stars at the end:

> run2 Done _ = 0
> run2 (Output x) as = 1 + run2 x as
> run2 (Input f) (a:as) = run2 (f a) as

Now you can try run2 (w^w^w^w) [2,2..] and see that we (eventually) get 2222.

Termination Again
If we run a machine in M there's a pattern that occurs again and again. We input a number, and then as a result we go into a loop requesting more numbers. These inputs may in turn request more inputs. Like the mythological hydra, every input we give may spawn many more requests for inputs. As the number of inputs required may depend on our previous inputs, and we may input numbers as large as we like, these machines may run for a long time. Suppose our machine terminates after requesting n inputs. Then there must be some highest number that we entered. Call it m. Then if the original machine was f w (with f defined in terms of the 3 operators and non-zero naturals), the machine must have terminated outputting no more than f m stars. So if our machine terminates, we can bound how many steps it took.

But do our machines always terminate? The input we give to the machine might not be bounded. If we run run2 (w^w) [4,5..], say, the inputs grow and grow. If these inputs grow faster than we can chop off the heads of our hydra, we might never reach termination.

Consider a program to input n and then output fib n. It accepts an input, recurses to a depth of at most n, and calls itself at most twice in each recursion. Compare with the machine 2 ^ w. This accepts an input n, recurses to a depth n, calling itself exactly twice each time. So if 2 ^ w terminates, so does fib. The more complex example above where I introduced the evil demon will terminate if w ^ w does, as long as the demon doesn't stop inputting numbers. So if we can show in one proof that every machine of type Machine terminates, then there are many programs whose termination we could easily prove.

Let's consider an example like run1 (w ^ w) with inputs 2, 3, 4, ...

We start with w ^ w. Examining the definition of the operator ^ we see that this proceeds by requesting an input. The first input is 2. Now we're left with w ^ 2. This is w * w. Again it accepts an input. This time 3. Now we go to state w * 3. This is w*2 + w. Again we accept an input. This time 4. We are led to w*2 + 4. This now outputs 4 stars and we are left with w * 2 which is w + w. We accept an input 5, output 5 stars and are left with w. After a further input of 6, it outputs 6 stars and terminates. Or we could just run run2 (w ^ w) [2,3..] and get 15(=4+5+6) as output.

The transitions are:
w ^ w
-> w ^ 2
-> w * w
-> w * 3
-> w*2 + w
-> w*2 + 4 -> ... -> w * 2
-> w + w
-> w + 5 -> ... w
-> 6 -> ... -> 0.

Now for some Set Theory. Rewrite the above sequence using the transfinite ordinal ω instead of w. The sequence becomes a sequence of ordinals. Any time we accept an input, the rightmost ω becomes a finite ordinal. So we have a descending sequence of ordinals. This is true whatever ordinal we start with. The execution of either Input or Output always strictly decreases our ordinal, and any descending sequence of ordinals must eventually terminate. Therefore every machine in M eventually terminates.

But here's the important fact: to show termination we used the ordinal ω, and this required the axiom of infinity and some Set Theory. Instead we could encode the termination question, via Godel numbering, as a proposition of Peano arithmetic. If we do this, then we hit against an amazing fact. It can't be proved using the axioms of Peano arithmetic. So we have here a useful fact, not a contrived self-referential one, that can't be proved with Peano arithmetic.

Why can't it be proved using just the Peano axioms?
A few years back, Jim Apple made a post about constructing (some) countable ordinals in Haskell. His construction nicely reflects the definitions a set theorist might make, but the code doesn't actually do anything. Later I learned from Hyland and Power how you can interpret algebraic structures as computational effects. apfelmus illustrates nicely how an abstract datatype can be made to do things with the help of an interpreter. Roughly speaking, doing this is what is known as operational semantics. So I thought, why not apply this approach to the algebraic rules for defining and combining ordinals. The result are the interpreters run1 and run2 above.

run1 gives an example of a Hydra game. In fact, its precisely the hydra game described in this paper because it always chops off the rightmost head. The Kirby-Paris theorem tells us we can't prove this game terminates using just the Peano axioms. A web search on Goodstein's theorem will reveal many great articles with the details.

A well-ordered quantity that you can keep decreasing as a program runs, and that can be used to prove termination, is an example of a loop variant. Loop variants are often natural numbers but the above shows that transfinite ordinals make fine loop variants. But in the interest of being fair and balanced, here's a dissenting view. The author has a point. If you are forced to use transfinite ordinals to show your program terminates, the age of the universe will probably be but the briefest flicker compared to your program's execution. On the other hand, if you don't want an actual bound on the execution time, ordinals can provide very short proofs of termination for useful programs.

(By the way, this article is a sequel to my own article.)

Algebraic structures give rise to monads. Can you see how to generalise the definition of Machine to make it a monad? If you pick the right definition then the substitution of a for Done in the definition of + should give you a particularly simple definition of ordinal addition. (See this for a hint on how substitution works in monads.)

> instance Show Machine
> instance Eq Machine
> instance Ord Machine