## Saturday, April 07, 2012

### Generalised entropy

Introduction
The entropy of a probability distribution can be seen as a measure of its uncertainty or a measure of the diversity of samples taken from it. Over the years I've talked lots about how probability theory gives rise to a monad. This suggests the possibility that maybe the notion of entropy can be generalised to monads other than probability. So here goes...

```> {-# LANGUAGE MultiParamTypeClasses, FlexibleInstances, GeneralizedNewtypeDeriving #-}
> {-# LANGUAGE FunctionalDependencies, TypeSynonymInstances #-}

```
Shannon entropy
I've talked in the past about how there is some trickiness with defining the probability monad in Haskell because a good implementation requires use of the Eq typeclass, and hence restricted monads. Restricted monads are possible through a bunch of methods, but this time I don't want them.

It's common to represent probability distributions on finite sets as lists of pairs where each pair (p, x) means x has a probability p. But I'm going to allow lists without the restriction that each x appears once and make my code work with these generalised distributions. When I compute the entropy, say, it will only be the usual entropy in the case that each x in the list is unique.

So here's our type and some instances for it:

```> data P a = P [(a, Float)] deriving Show

> instance Functor P where
>      fmap f (P xs) = P [(f a, p) | (a, p) <- xs]

>      return x = P [(x, 1)]
>      P xss >>= f = P [(y, p*q) | (pxs, p) <- xss, let P ys = f pxs, (y, q) <- ys]

```
We can easily compute the expected value of a distribution, and its entropy, like this:

```> expectation0 (P xs) = sum [x*p | (x, p) <- xs]
> entropy0 (P xs) = -sum [if p==0 then 0 else p*log p/log 2.0 | (_, p) <- xs]

```
An important property of entropy is known as the grouping property which can be illustrated through an example tree like this:

The entropy for the probability distribution of the final leaves is the sum of two components: (1) the entropy of the branch at the root of the tree and (2) the expected entropy of the subtrees. Here's some corresponding code. First simple bernoulli trials:

```> bernoulli p a b = P [(a, p), (b, 1-p)]

```
Now the branch at the root of the tree:

```> root = bernoulli 0.3 False True

```
We can compute the entropy for the distrbution on the leaves:

```> test1 = entropy0 \$ do
>   x <- root
>   if x
>     then bernoulli 0.2 3 4
>     else bernoulli 0.4 5 6

```
Or the sum of the root entropy and the expected subtree entropy:

```> test2 = entropy0 root + (expectation0 \$ do
>   x <- root
>   if x
>     then return \$ entropy0 (bernoulli 0.2 3 4)
>     else return \$ entropy0 (bernoulli 0.4 5 6))

```
You can confirm for yourself that test1 == test2.

We can rewrite that a little. We're drawing True or False from root only to decide which distribution to use at the next stage. But we may as will pick the distribution itself at random. So define:

```> dist = bernoulli 0.3 (bernoulli 0.4 5 6) (bernoulli 0.2 3 4)

```
And now we expect the equality of test3 and test4:

```> test3 = entropy0 \$ do
>   x <- dist
>   x

> test4 = entropy0 dist + (expectation0 \$ do
>   x <- dist
>   return \$ entropy0 x)

```
There's a more elegant way of writing this. Define:

```> left0 dist = entropy0 (join dist)
> right0 dist = entropy0 dist+expectation0 (fmap entropy0 dist)

```
Now we expect left0 dist and right0 dist to always be equal. We've almost generalised to something that makes sense in the context of monads other than probability.

Here are a couple of important properties of expectation0:

1. expectation0 (return d) = d
2. expectation0 (join d) = expectation0 (fmap expectation d)

In English: the expectation of certainty is just the certain value, and the expectation of an expectation is just the expectation. But these rules are precisely the conditions that define an -algebra, where is a monad.

So let's define a type class:

```> class Algebra m a | m -> a where
>      expectation :: m a -> a

```
We'll assume that when m is a monad, any instance satisfies the two laws above. Here's the instance for probability:

```> instance Algebra P Float where
>      expectation (P xs) = sum [x*p | (x, p) <- xs]

```
In keeping with the notion that entropy measure diversity let's also define:

```> class Diverse m r | m -> r where
>      entropy :: m x -> r

```
with the instance:

```> instance Diverse P Float where
>      entropy (P xs) = -sum [if p==0 then 0 else p*log p/log 2.0 | (_, p) <- xs]

```
It's not clear what laws we need but for now we'll assume a generalised entropy satisfies left dist == right dist :

```> left dist = entropy (join dist)
> right dist = entropy dist+expectation (fmap entropy dist)

```
We'll call that the generalised grouping law.

Binary trees
It's not hard to find other structures that satisfy these laws if we cheat and use alternative structures to represent probabilities. For example We can make Tree an instance by assuming Fork represents a 50/50 chance of going one way or another:

```> data Tree a = Leaf a | Fork (Tree a) (Tree a) deriving Show

> instance Functor Tree where
>      fmap f (Leaf a) = Leaf (f a)
>      fmap f (Fork l r) = Fork (fmap f l) (fmap f r)

>      return x = Leaf x
>      Leaf a >>= f = f a
>      Fork l r >>= f = Fork (l >>= f) (r >>= f)

> instance Algebra Tree Float where
>      expectation (Leaf a) = a
>      expectation (Fork l r) = 0.5*expectation l+0.5*expectation r

> instance Diverse Tree Float where
>      entropy (Leaf a) = 0
>      entropy (Fork l r) = 1+0.5*entropy l+0.5*entropy r

```
Lists
We could make non-empty lists into an instance by assuming a uniform distribution on the list. But another way to measure the diversity is simply to count the elements. We subtract one so that [x] corresponds to diversity zero. This subtraction gives us a non-trivial instance:

```> newtype L a = L [a] deriving (Show, Monad, Functor)

> instance Algebra L Int where
>      expectation (L xs) = sum xs

> instance Diverse L Int where
>      entropy (L xs) = length xs-1

```
Tsallis entropy
There are measures of diversity for probability distributions that are distinct from Shannon entropy. An example is Tsallis entropy. At this point I'd like a family of types parametrised by reals but Haskell doesn't support dependent types. So I'll just fix a real number q and we can define:

```> q = 2.5

> data T a = T [(a, Float)] deriving Show

> instance Functor T where
>      fmap f (T xs) = T [(f a, p) | (a, p) <- xs]

>      return x = T [(x, 1)]
>      T xss >>= f = T [(y, p*q) | (pxs, p) <- xss, let T ys = f pxs, (y, q) <- ys]

> instance Algebra T Float where
>      expectation (T xs) = sum [x*p**q | (x, p) <- xs]

> instance Diverse T Float where
>      entropy (T xs) = (1-sum [p**q | (_, p) <- xs])/(q-1)

```
And again we find our generalised grouping rule for entropy holds.

This is all derived from Tom Leinster's post last year at the n-category cafe. As I talked about here there's a close relationship between monads and operads. Operads area a bit like container monads where the containers don't contain anything, but just have holes where contents could be placed. This makes operads a better place to work because you don't have the awkward issue I started with: having to disallow lists of value/probability pairs where the same value can appear more than once. Nonetheless, in (unrestricted) Haskell monads you don't have Eq available so you can't actually have definitions of return or >>= that can notice the equality of two elements. If such definitions were possible, the grouping law would no longer work as stated above.

Crossed homomorphisms
The generalised grouping law even makes sense for very different monads. For the Reader monad the law gives the definition of a crossed homomorphism. It's pretty weird seeing a notion from group cohomology emerge like this and I recommend skipping to the final section unless you care about this sort of thing. But if you do, this is related to research I did a long time ago. This is to test that the Schwarzian derivative really does give rise to a crossed homomorphism.

Firstly let me set up some automatic differentiation code:

```> data D a = D { re::a, im::a } deriving (Show, Ord, Eq)

> instance Num a => Num (D a) where
>      fromInteger n = D (fromInteger n) 0
>      D a a'+D b b' = D (a+b) (a'+b')
>      D a a'*D b b' = D (a*b) (a*b'+a'*b)
>      D a a'-D b b' = D (a-b) (a'-b')

> instance Fractional a => Fractional (D a) where
>      fromRational n = D (fromRational n) 0
>      D a a'/D b b' = let q = 1/b in D (a*q) ((-a*b'+a'*b)*q*q)

> lift x = D x 0

> d f x = im (f (D x 1))

> raised f = re . f . lift
> raised2 = raised . raised
> raised3 = raised2 . raised

```
The Cn are the n-times (automatically) differentiable functions. Unfortunately the Endo defined in Data.Monoid acts the wrong way round from what I want so I need a Dual:

```> type C1 = Dual (Endo (D Double))
> type C3 = Dual (Endo (D (D (D Double))))
> type C4 = Dual (Endo (D (D (D (D Double)))))

> instance Eq (Endo (D Double))
> instance Ord (Endo (D Double))

```
A silly Show instance that simply evaluates a function at a number I chose randomly: 1.234.

```> instance Show (Endo (D Double)) where
>         show (Endo f) = show (f 1.234)

> instance Num C1 where
>      fromInteger n = Dual (Endo (\x -> fromInteger n))
>      Dual (Endo  f)+Dual (Endo  g) = Dual (Endo  (\x -> f x + g x))
>      Dual (Endo  f)-Dual (Endo  g) = Dual (Endo  (\x -> f x - g x))
>      Dual (Endo  f)*Dual (Endo  g) = Dual (Endo  (\x -> f x * g x))

> instance Fractional C1 where
>      fromRational n = Dual (Endo (\x -> fromRational n))
>      Dual (Endo f)/Dual (Endo g) = Dual (Endo (\x -> f x / g x))

> newtype Q a = Q (Writer C4 a) deriving (Monad, Functor)

```
We can give Q a a geometrical interpretation. The underlying type is a pair (a, C4). If we think of elements of C4 as charts charts on a piece of Riemann surface then for any , an element of (a, C4) represents a local piece of a section of the th tensor power of the canonical bundle. Ie. we can think of it as representing . I'll concentrate on the case which gives quadratic differentials. We can think of an element of ((a, C4), C4) as forms where we're composing two charts. We can collapse down to an ordinary chart by using the chain rule. Here's the code:

```> instance Algebra Q C1 where
>      expectation (Q ma) = let (Dual (Endo a), Dual (Endo f)) = runWriter ma
>                           in Dual (Endo (\x -> a (raised3 f x)*(raised2 (d f) x)^2))

```
Now we can define the Schwarzian derivative:

```> schwarzian f x = let f0 = raised3 f x
>                      f1 = raised2 (d f) x
>                      f2 = raised (d \$ d f) x
>                      f3 = (d \$ d \$ d f) x
>                  in f3/f1-1.5*(f2/f1)^2

```
And somwehat bizarrely, we now have a generalised entropy:

```> instance Diverse Q C1 where
>      entropy (Q ma) = let (_, Dual (Endo f)) = runWriter ma
>                       in Dual (Endo (\x -> schwarzian f x))

```
This is the construction that gives rise to the Virasoro algebra which plays such an important role in String Theory.

Some tests
And here's a bunch of tests. I'd have used QuickCheck but it won't install for me today...

```> test :: (Algebra m t, Diverse m t, Num t, Functor m, Monad m) => m (m x) -> IO ()
> test x = do
>      print (left x, right x)

> main = do
>      test \$ L [L [1, 2, 3], L [2, 3, 4], L , L , L [2, 7::Int]]
>      test \$ P [(P [(0, 0.5), (1, 0.5)], 0.5), (P [(2, 0.5), (3::Int, 0.5)], 0.5::Float)]
>      test \$ T [(T [(0, 0.5), (1, 0.5)], 0.5), (T [(2, 0.5), (3::Int, 0.5)], 0.5::Float)]
>      test \$ Leaf (Leaf 1 `Fork` Leaf 2) `Fork` Leaf (Leaf 3 `Fork` (Leaf 4 `Fork` Leaf 5))
>      test \$ (Q (writer
>               (Q (writer (Dual (Endo (\x -> x)),
>                           Dual (Endo (\x -> x^2+1)))),
>                           Dual (Endo (\x -> (2+x)/(3+x*x))))) :: Q (Q C3))
``` Anonymous said...