Tuesday, November 09, 2010

Statistical Fingertrees

I have no time to post a proper article. But I have time to post a mini-article with details in the links.

> {-# LANGUAGE MultiParamTypeClasses #-}
> import Data.FingerTree
> import Data.Monoid

I'm a bad person. I often use this method to compute the mean and variance of some samples. It's not robust. It performs badly if the variance is small compared to the size of the samples. I shouldn't use it. As penance, this is a quick article on dynamically computing variances robustly and efficiently for datasets that are frequently manipulated.

For convenience I'll talk about the unscaled variance: the sum, not the average of the square deviation from the mean. You can easily compute the variance from this if you know the dataset size.

If we know the size, mean and unscaled variance of two sets (more properly, multisets) we can find the mean and unscaled variance of the their union using the formula here. This method is much more robust that the naive algorithm.

The rule for combining two datasets gives us a monoid:

> data Stats = Stats { n :: Float, mean :: Float, unscaledVariance :: Float }
>              deriving Show

> instance Monoid Stats where
>     mempty = Stats 0 0 undefined
>     Stats n m v `mappend` Stats 0 _ _ = Stats n m v
>     Stats 0 _ _ `mappend` Stats n m v = Stats n m v
>     Stats n m v `mappend` Stats n' m' v' = 
>       let delta = m' - m
>       in Stats (n + n') ((n*m+n'*m')/(n+n'))
>                (v + v' + delta*delta*n*n'/(n+n'))

Given a single sample, we can compute its stats:

> instance Measured Stats Float where
>     measure x = Stats 1 x 0

Now we need just one more line of code:

> type StatsTree = FingerTree Stats Float

We now have a data structure that allows us to freely split, join, delete elements from and add elements to sequences of samples, all the while robustly keeping track of their mean and unscaled variance (and hence their variance).

For example:

> example = fromList [1..10] :: StatsTree
> test = let (_, b) = split ((>=4) . n) example
>            (c, _) = split ((>3)  . n) b
>        in measure c

computes the stats for the 3 elements starting at the 4th element of example.

An example application might be maintaining rolling averages and variances for a sliding window.

This was inspired an article by John D Cook somewhere around here.

No comments:

Blog Archive