# A Neighborhood of Infinity

## Saturday, August 13, 2011

### Computing errors with square roots of infinitesimals.

Abstract
Automatic differentiation (AD) gives a way to carry out uncertainty propagation. But used in the obvious way it leads to bias. This article introduces "square roots of infinitesimals" that can be used to give more accurate results.

Introduction
In the real world measurements have errors and we often want to know how much our final answers are affected by those errors. One tool for measuring the sensitivity to errors of our results is calculus. In fact, we can use automatic differentiation to give us a nice way to model error. Here's an implementation:

> import Control.Monad
> import Control.Applicative
> import qualified Data.IntMap as I

> infixl 6 .+.
> infixl 7 .*
> infixl 7 *.

> data D a = D a a deriving (Eq, Show)

> instance Num a => Num (D a) where
>   D x a+D x' a' = D (x + x') (a + a')
>   D x a*D x' a' = D (x*x') (a*x' + x*a')
>   negate (D x a) = D (negate x) (negate a)
>   fromInteger n = D (fromInteger n) 0
>   abs _ = error "No abs"
>   signum _ = error "No signum"

> d = D 0 1


As I've talked about before, the value d can be thought of as an infinitesimal number whose square is zero. However, to first order we can replace d with a small number and use it to compute errors. Here's a function to perform such a substitution:

> approx :: Num a => D a -> a -> a
> approx (D x d) e = x+d*e


Suppose we have a square whose side we've measured as 1m to an accuracy of 1cm. We can represent this as:

> sq_side = D 1 0.01


We can now compute the area:

> sq_area = sq_side^2


We get D 1.0 2.0e-2. We can interpret this as meaning the area is 1m2 with an accuracy of 0.02m2.

We can make "to an accuracy of" more precise. Differentiation models a function locally as an approximate linear function. (1m+δ)2=1m2+2mδ+O(δ2). So at 1m, the function to compute area locally scales lengths by 2m. So if the length measurement is actually a sample from a distribution with given mean 1m and standard deviation 0.01m, the area is approximately a sample from a distribution with mean 1m2 and SD 0.02m2.

This approach is nice, but sometimes we want a little more accuracy in our estimates. In particular, if you square a sample from a normal distribution with small variance and positive mean, then the nonlinearity of the squaring operation means that samples that are larger than the mean move further away from the mean, when squared, than samples less than the mean. So we should actually expect our area computations to be slightly biased upwards from 1m2. Unfortunately, this is a second order effect that isn't visible from looking only at first derivatives.

That's not a problem, we can easily compute second derivatives using automatic differentiation. However, that can complicate things. What happens if we use multiple measurements to compute a quantity? Each one is a different sample from a different distribution and we don't want these measurements to be correlated. If we approach this in the obvious way, when we want to use n measurements we'll need to compute n2 partial second derivatives. However, by tweaking AD slightly we'll only need n derivatives.

Square roots of infinitesimals
In addition to the usual infinitesimal d we want to introduce quantities, w_i, that represent independent random "noise" variables that are infinitesimal in size. We'll be interested in expectation values so we'll also need an expectation function, e. We want e(w_i)=0. But wi2 is always positive so its expectation is always greater than or equal to zero. We want our random variables to be infinitesimal so we pick e(wi2)=d. We also want e(wiwj)=0 because of independence. If the wi are already infinitesimal, the dwi should be zero. So let's define an algebraic structure that captures these relationships. So we extend D a by introducing w i so that:

Any element of this algebra can be written as x+ad+Σbizi. We represent b sparsely by using an IntMap. Here's an implementation:

> data S a = S a a (I.IntMap a) deriving (Eq, Show)

> (.+.) :: Num a => I.IntMap a -> I.IntMap a -> I.IntMap a
> ( *.) :: Num a => a -> I.IntMap a -> I.IntMap a
> (.*)  :: Num a => I.IntMap a -> a -> I.IntMap a
> (.*.) :: Num a => I.IntMap a -> I.IntMap a -> a

> (.+.) = I.unionWith (+)
> a *. v = I.map (a *) v
> v .* b = I.map (* b) v
> a .*. b = I.fold (+) 0 $I.intersectionWith (*) a b > instance Num a => Num (S a) where > S x a b+S x' a' b' = S (x + x') (a + a') (b .+. b') > S x a b*S x' a' b' = S (x*x') (a*x' + x*a' + b.*.b') (x*.b' .+. b.*x') > negate (S x a b) = S (negate x) (negate a) (I.map negate b) > fromInteger n = S (fromInteger n) 0 I.empty > abs _ = error "No abs" > signum _ = error "No signum"  Here are the individual w i: > w :: Num a => Int -> S a > w i = S 0 0 (I.fromList [(i, 1)])  We compute expectation values linearly by mapping the w i to zero: > e :: Num a => S a -> D a > e (S x a _) = D x a  We can also represent numbers whose values we know precisely: > sure x = S x 0 I.empty  Example Let's revisit the area example. This time we can represent the length of the side of our square as > sq_side' = 1+0.01*w 0 > sq_area' = sq_side'^2  We get S 1.0 1.0e-4 (fromList [(0,2.0e-2)]). We can directly read off that we have a bias of 10-4m2 which is 1cm^2. We can encapsulate this as: > mean f = approx (e f) 1  We can directly read off the variance from the element of the algebra. However, we can also compute the variance using mean. It's just: > var f = mean ((f-sure (mean f))^2)  (Note that this gives a very slightly different result from the value you can read off directly from the S object. It depends on whether we're measuring the deviation around the unbiased or biased mean. To the order we're considering here the difference is small. Here's var' anyway: > var' (S _ _ v) = I.fold (+) 0$ I.map (\x -> x^2) v


)

We can also define covariance:

> cov f g = mean ((f-sure (mean f))*(g-sure (mean g)))


More functions
We can now follow through just like with automatic differentiation to compute lots more functions. We use the fact that:

> instance Fractional a => Fractional (S a) where
>   fromRational x = S (fromRational x) 0 I.empty
>   recip (S x a b) = let r = recip x
>                     in S r (-a*r*r+r*r*r*(b.*.b)) ((-r*r)*.b)

> instance Floating a => Floating (S a) where
>   pi = sure pi
>   sin (S x a b) = let s = sin x
>                       c = cos x
>                   in S s (a*c - s/2*(b.*.b)) (c*.b)
>   cos (S x a b) = let s = sin x
>                       c = cos x
>                   in S c (-a*s - c/2*(b.*.b)) ((-s)*.b)
>   exp (S x a b) = let e = exp x
>                   in S e (a*e + e/2*(b.*.b)) (e*.b)
>   sqrt (S x a b) = let s = sqrt x
>                   in S s (a/(2*s)-1/(4*s*s*s)*(b.*.b)) (1/(2*s)*.b)
>   log (S x a b) = let r = 1/x
>                   in S (log x) (r*a-r*r/2*(b.*.b)) (r*.b)
>   asin = undefined
>   acos = undefined
>   atan = undefined
>   sinh = undefined
>   cosh = undefined
>   tanh = undefined
>   asinh = undefined
>   acosh = undefined
>   atanh = undefined


A real example
Let's make this effort worthwhile. We'll compute errors for a computation that uses the errors in a messy nonlinear way. Suppose we're in the lab measuring radioactive decay. We measure the geiger counter reading at times t = 0hr, 1hr, 2hr, 3hr, 4hr at which point we compute an estimate for when the decay will drop to one tenth of its original value. We'll assume the decay fits a model counts/sec = a exp(-λt) and that the counts have an error with SD 0.05. We're going to compute the error in the estimated time to hit one tenth radioactivity in the case when the half life is 30 minutes and a=2:

> t = [0..4]
> counts = map (\i-> 2*exp(-0.5*fromIntegral i)+0.05*w i) t



We'll be fitting a curve using logarithmic regression so we'll need the following function. Given a pair of lists x and y it returns (m, c) where y=mx+c is the standard least squares fit.

> regress :: Fractional a => [a] -> [a] -> (a, a)
> regress x y =
>     let sx = sum x
>         sy = sum y
>         sxx = sum $map (^2) x > sxy = sum$ zipWith (*) x y
>         n = fromIntegral (length x)
>         s = 1/(sx*sx-n*sxx)
>     in (s*sx*sy-s*n*sxy, -s*sxx*sy+s*sx*sxy)


Logarithmic regression:

> (m, c) = regress (map fromIntegral t) (map log counts)
> lambda = -m
> a = exp c
> t_tenth = -log (0.1/a)/lambda


We can now go ahead and compute the mean and variance of our estimate:
*Main> mean t_tenth
5.98036172868899
*Main> var t_tenth
0.15583537298560224

The correct time is about 5.991 so the regression method above is biased by about 0.01. If we repeated the same experiment over and over again and averaged the estimates we got from logarithmic regression the process would not converge to the correct result. In fact, we can compute "ground truth" by simulating the experiment a million times in Octave and estimate the mean and variance from that. The code is in the appendix. Obviously this is a much slower process but it clearly demonstrates the biasedness of using regression this way.
GNU Octave, version 3.4.0
Copyright (C) 2011 John W. Eaton and others.
ans =  5.9798
ans =  0.15948


Final thoughts
This is yet another example of extending automatic differentiation. We have variants for single variable differentiation, multivariate differentiation, multiple differentiation, divided differences, splitting a function into odd and even parts and now automatic error propagation.

This stuff was very loosely inspired by reading An Introduction to Stochastic Processes in Physics. I'm attempting to capture the semi-formal rules used in that book to reason about differentials and you can think of the algebra above as representing stochastic differentials. I made a guess that the algebra is called the Itō algebra. Sure enough, you'll get a few hits.

The most similar published work I can find is Automatic Propagation of Uncertainties but it seems to just use ordinary AD.

This technique may be useful for Extended Kalman Filtering.

I haven't done the work to make precise statements about how accurate you can expect my estimates of expectations to be.

It's possible to implement a monad with syntax similar to other probability monads by using state to bump up the i in w i each time you generate a new random variable. But bear in mind, these are always intended to be used as *infinitesimal* random variables.

Appendix: Octave code
m = 5;
n = 1000000;

x = repmat([0:m-1]',1,n);
y = repmat([2*exp(-0.5*[0:m-1]')],1,n)+0.05*normrnd(0,1,m,n);

sx = sum(x);
sxx = sum(x.*x);
p = sum(log(y));
q = sum(x.*log(y));

s = 1./(sx.*sx-m*sxx);
m = s.*sx.*p-m*s.*q; # Redefined
c = -s.*sxx.*p+s.*sx.*q;

lambda = -m;
a = exp(c);
x_tenth = -log(0.1./a)./lambda;

mean(x_tenth)
var(x_tenth)


Aaron Denney said...

Stochastic calculus comes in two forms, Itō as you noticed, and Stratonovich. They're both defined with a limiting process similar to the Riemann integral. They basically differ in that the Itō formulation is explicitly "causal" and function evaluation happens at the beginning of a segment. The Stratonovich formulation uses a balanced "midpoint" evaluation strategy. Unlike the Riemann integral case, these two converge differently. The big draw for the Itō formulation is that the calculations are much easier, as many expectation values are easily seen to be zero. The big draw for the Stratonovich formulation is that corresponds more directly to physical differential equations, and is the only one that has the right transformation properties to be used on manifolds.

Orwin said...

It is fair to point out that rounding errors in computation do not match errors of estimate in physics, which may run to three decimal places. Conversion to binary makes the problem worse: to think otherwise is just structuralist myth-making.

Only in the ancient continued fractions are rational and irrational numbers comprehensively distinguished by termination. But an error of estimate encompasses noise as well as finite accuracy, so the problem is generic, and comparable rather to heat of computation!

Polymath said...

Abraham Robinson's Non-Standard Analysis introduces infinitesimals into the number-line. It would be interesting (and challenging) to construct an algorithmic 'library' implementing this rigorous, but relatively simple theory. I expect that implementation of your ideas in NS Analysis would be result in very simple programs.
The library would have application to many areas of mathematics. That said, I am not volunteering to map the theory into any existing language!