### Computing errors with square roots of infinitesimals.

**Abstract**

Automatic differentiation (AD) gives a way to carry out uncertainty propagation. But used in the obvious way it leads to bias. This article introduces "square roots of infinitesimals" that can be used to give more accurate results.

**Introduction**

In the real world measurements have errors and we often want to know how much our final answers are affected by those errors. One tool for measuring the sensitivity to errors of our results is calculus. In fact, we can use automatic differentiation to give us a nice way to model error. Here's an implementation:

> import Control.Monad > import Control.Monad.State > import Control.Applicative > import qualified Data.IntMap as I > infixl 6 .+. > infixl 7 .* > infixl 7 *. > data D a = D a a deriving (Eq, Show) > instance Num a => Num (D a) where > D x a+D x' a' = D (x + x') (a + a') > D x a*D x' a' = D (x*x') (a*x' + x*a') > negate (D x a) = D (negate x) (negate a) > fromInteger n = D (fromInteger n) 0 > abs _ = error "No abs" > signum _ = error "No signum" > d = D 0 1As I've talked about before, the value

`d`can be thought of as an infinitesimal number whose square is zero. However, to first order we can replace

`d`with a small number and use it to compute errors. Here's a function to perform such a substitution:

> approx :: Num a => D a -> a -> a > approx (D x d) e = x+d*eSuppose we have a square whose side we've measured as 1m to an accuracy of 1cm. We can represent this as:

> sq_side = D 1 0.01We can now compute the area:

> sq_area = sq_side^2We get

`D 1.0 2.0e-2`. We can interpret this as meaning the area is 1m

^{2}with an accuracy of 0.02m

^{2}.

We can make "to an accuracy of" more precise. Differentiation models a function locally as an approximate linear function. (1m+δ)

^{2}=1m

^{2}+2mδ+O(δ

^{2}). So at 1m, the function to compute area locally scales lengths by 2m. So if the length measurement is actually a sample from a distribution with given mean 1m and standard deviation 0.01m, the area is approximately a sample from a distribution with mean 1m

^{2}and SD 0.02m

^{2}.

This approach is nice, but sometimes we want a little more accuracy in our estimates. In particular, if you square a sample from a normal distribution with small variance and positive mean, then the nonlinearity of the squaring operation means that samples that are larger than the mean move further away from the mean, when squared, than samples less than the mean. So we should actually expect our area computations to be slightly biased upwards from 1m

^{2}. Unfortunately, this is a second order effect that isn't visible from looking only at first derivatives.

That's not a problem, we can easily compute second derivatives using automatic differentiation. However, that can complicate things. What happens if we use multiple measurements to compute a quantity? Each one is a different sample from a different distribution and we don't want these measurements to be correlated. If we approach this in the obvious way, when we want to use n measurements we'll need to compute n

^{2}partial second derivatives. However, by tweaking AD slightly we'll only need n derivatives.

**Square roots of infinitesimals**

In addition to the usual infinitesimal d we want to introduce quantities, w_i, that represent independent random "noise" variables that are infinitesimal in size. We'll be interested in expectation values so we'll also need an expectation function, e. We want e(w_i)=0. But w

_{i}

^{2}is always positive so its expectation is always greater than or equal to zero. We want our random variables to be infinitesimal so we pick e(w

_{i}

^{2})=d. We also want e(w

_{i}w

_{j})=0 because of independence. If the w

_{i}are already infinitesimal, the dw

_{i}should be zero. So let's define an algebraic structure that captures these relationships. So we extend

`D a`by introducing

`w i`so that:

Any element of this algebra can be written as x+ad+Σb

_{i}z

_{i}. We represent b sparsely by using an

`IntMap`. Here's an implementation:

> data S a = S a a (I.IntMap a) deriving (Eq, Show) > (.+.) :: Num a => I.IntMap a -> I.IntMap a -> I.IntMap a > ( *.) :: Num a => a -> I.IntMap a -> I.IntMap a > (.*) :: Num a => I.IntMap a -> a -> I.IntMap a > (.*.) :: Num a => I.IntMap a -> I.IntMap a -> a > (.+.) = I.unionWith (+) > a *. v = I.map (a *) v > v .* b = I.map (* b) v > a .*. b = I.fold (+) 0 $ I.intersectionWith (*) a b > instance Num a => Num (S a) where > S x a b+S x' a' b' = S (x + x') (a + a') (b .+. b') > S x a b*S x' a' b' = S (x*x') (a*x' + x*a' + b.*.b') (x*.b' .+. b.*x') > negate (S x a b) = S (negate x) (negate a) (I.map negate b) > fromInteger n = S (fromInteger n) 0 I.empty > abs _ = error "No abs" > signum _ = error "No signum"Here are the individual

`w i`:

> w :: Num a => Int -> S a > w i = S 0 0 (I.fromList [(i, 1)])We compute expectation values linearly by mapping the

`w i`to zero:

> e :: Num a => S a -> D a > e (S x a _) = D x aWe can also represent numbers whose values we know precisely:

> sure x = S x 0 I.empty

**Example**

Let's revisit the area example. This time we can represent the length of the side of our square as

> sq_side' = 1+0.01*w 0 > sq_area' = sq_side'^2We get

`S 1.0 1.0e-4 (fromList [(0,2.0e-2)])`. We can directly read off that we have a bias of 10

^{-4}m

^{2}which is 1cm^2. We can encapsulate this as:

> mean f = approx (e f) 1We can directly read off the variance from the element of the algebra. However, we can also compute the variance using

`mean`. It's just:

> var f = mean ((f-sure (mean f))^2)(Note that this gives a very slightly different result from the value you can read off directly from the

`S`object. It depends on whether we're measuring the deviation around the unbiased or biased mean. To the order we're considering here the difference is small. Here's

`var'`anyway:

> var' (S _ _ v) = I.fold (+) 0 $ I.map (\x -> x^2) v)

We can also define covariance:

> cov f g = mean ((f-sure (mean f))*(g-sure (mean g)))

**More functions**

We can now follow through just like with automatic differentiation to compute lots more functions. We use the fact that:

> instance Fractional a => Fractional (S a) where > fromRational x = S (fromRational x) 0 I.empty > recip (S x a b) = let r = recip x > in S r (-a*r*r+r*r*r*(b.*.b)) ((-r*r)*.b) > instance Floating a => Floating (S a) where > pi = sure pi > sin (S x a b) = let s = sin x > c = cos x > in S s (a*c - s/2*(b.*.b)) (c*.b) > cos (S x a b) = let s = sin x > c = cos x > in S c (-a*s - c/2*(b.*.b)) ((-s)*.b) > exp (S x a b) = let e = exp x > in S e (a*e + e/2*(b.*.b)) (e*.b) > sqrt (S x a b) = let s = sqrt x > in S s (a/(2*s)-1/(4*s*s*s)*(b.*.b)) (1/(2*s)*.b) > log (S x a b) = let r = 1/x > in S (log x) (r*a-r*r/2*(b.*.b)) (r*.b) > asin = undefined > acos = undefined > atan = undefined > sinh = undefined > cosh = undefined > tanh = undefined > asinh = undefined > acosh = undefined > atanh = undefined

**A real example**

Let's make this effort worthwhile. We'll compute errors for a computation that uses the errors in a messy nonlinear way. Suppose we're in the lab measuring radioactive decay. We measure the geiger counter reading at times t = 0hr, 1hr, 2hr, 3hr, 4hr at which point we compute an estimate for when the decay will drop to one tenth of its original value. We'll assume the decay fits a model counts/sec = a exp(-λt) and that the counts have an error with SD 0.05. We're going to compute the error in the estimated time to hit one tenth radioactivity in the case when the half life is 30 minutes and a=2:

> t = [0..4] > counts = map (\i-> 2*exp(-0.5*fromIntegral i)+0.05*w i) t

We'll be fitting a curve using logarithmic regression so we'll need the following function. Given a pair of lists

`x`and

`y`it returns

`(m, c)`where y=mx+c is the standard least squares fit.

> regress :: Fractional a => [a] -> [a] -> (a, a) > regress x y = > let sx = sum x > sy = sum y > sxx = sum $ map (^2) x > sxy = sum $ zipWith (*) x y > n = fromIntegral (length x) > s = 1/(sx*sx-n*sxx) > in (s*sx*sy-s*n*sxy, -s*sxx*sy+s*sx*sxy)Logarithmic regression:

> (m, c) = regress (map fromIntegral t) (map log counts) > lambda = -m > a = exp c > t_tenth = -log (0.1/a)/lambdaWe can now go ahead and compute the mean and variance of our estimate:

*Main> mean t_tenth 5.98036172868899 *Main> var t_tenth 0.15583537298560224The correct time is about 5.991 so the regression method above is biased by about 0.01. If we repeated the same experiment over and over again and averaged the estimates we got from logarithmic regression the process would not converge to the correct result. In fact, we can compute "ground truth" by simulating the experiment a million times in Octave and estimate the mean and variance from that. The code is in the appendix. Obviously this is a much slower process but it clearly demonstrates the biasedness of using regression this way.

GNU Octave, version 3.4.0 Copyright (C) 2011 John W. Eaton and others. ans = 5.9798 ans = 0.15948

**Final thoughts**

This is yet another example of extending automatic differentiation. We have variants for single variable differentiation, multivariate differentiation, multiple differentiation, divided differences, splitting a function into odd and even parts and now automatic error propagation.

This stuff was very loosely inspired by reading An Introduction to Stochastic Processes in Physics. I'm attempting to capture the semi-formal rules used in that book to reason about differentials and you can think of the algebra above as representing stochastic differentials. I made a guess that the algebra is called the Itō algebra. Sure enough, you'll get a few hits.

The most similar published work I can find is Automatic Propagation of Uncertainties but it seems to just use ordinary AD.

This technique may be useful for Extended Kalman Filtering.

I haven't done the work to make precise statements about how accurate you can expect my estimates of expectations to be.

It's possible to implement a monad with syntax similar to other probability monads by using state to bump up the

`i`in

`w i`each time you generate a new random variable. But bear in mind, these are always intended to be used as *infinitesimal* random variables.

**Appendix: Octave code**

m = 5; n = 1000000; x = repmat([0:m-1]',1,n); y = repmat([2*exp(-0.5*[0:m-1]')],1,n)+0.05*normrnd(0,1,m,n); sx = sum(x); sxx = sum(x.*x); p = sum(log(y)); q = sum(x.*log(y)); s = 1./(sx.*sx-m*sxx); m = s.*sx.*p-m*s.*q; # Redefined c = -s.*sxx.*p+s.*sx.*q; lambda = -m; a = exp(c); x_tenth = -log(0.1./a)./lambda; mean(x_tenth) var(x_tenth)