### Generalised entropy

**Introduction**

The entropy of a probability distribution can be seen as a measure of its uncertainty or a measure of the diversity of samples taken from it. Over the years I've talked lots about how probability theory gives rise to a monad. This suggests the possibility that maybe the notion of entropy can be generalised to monads other than probability. So here goes...

> {-# LANGUAGE MultiParamTypeClasses, FlexibleInstances, GeneralizedNewtypeDeriving #-} > {-# LANGUAGE FunctionalDependencies, TypeSynonymInstances #-} > import Control.Monad > import Control.Monad.Writer hiding (lift)

**Shannon entropy**

I've talked in the past about how there is some trickiness with defining the probability monad in Haskell because a good implementation requires use of the

`Eq`typeclass, and hence restricted monads. Restricted monads are possible through a bunch of methods, but this time I don't want them.

It's common to represent probability distributions on finite sets as lists of pairs where each pair

`(p, x)`means

`x`has a probability

`p`. But I'm going to allow lists without the restriction that each

`x`appears once and make my code work with these generalised distributions. When I compute the entropy, say, it will only be the usual entropy in the case that each

`x`in the list is unique.

So here's our type and some instances for it:

> data P a = P [(a, Float)] deriving Show > instance Functor P where > fmap f (P xs) = P [(f a, p) | (a, p) <- xs] > instance Monad P where > return x = P [(x, 1)] > P xss >>= f = P [(y, p*q) | (pxs, p) <- xss, let P ys = f pxs, (y, q) <- ys]We can easily compute the expected value of a distribution, and its entropy, like this:

> expectation0 (P xs) = sum [x*p | (x, p) <- xs] > entropy0 (P xs) = -sum [if p==0 then 0 else p*log p/log 2.0 | (_, p) <- xs]An important property of entropy is known as the grouping property which can be illustrated through an example tree like this:

The entropy for the probability distribution of the final leaves is the sum of two components: (1) the entropy of the branch at the root of the tree and (2) the expected entropy of the subtrees. Here's some corresponding code. First simple bernoulli trials:

> bernoulli p a b = P [(a, p), (b, 1-p)]Now the branch at the root of the tree:

> root = bernoulli 0.3 False TrueWe can compute the entropy for the distrbution on the leaves:

> test1 = entropy0 $ do > x <- root > if x > then bernoulli 0.2 3 4 > else bernoulli 0.4 5 6Or the sum of the root entropy and the expected subtree entropy:

> test2 = entropy0 root + (expectation0 $ do > x <- root > if x > then return $ entropy0 (bernoulli 0.2 3 4) > else return $ entropy0 (bernoulli 0.4 5 6))You can confirm for yourself that

`test1 == test2`.

We can rewrite that a little. We're drawing

`True`or

`False`from

`root`only to decide which distribution to use at the next stage. But we may as will pick the distribution itself at random. So define:

> dist = bernoulli 0.3 (bernoulli 0.4 5 6) (bernoulli 0.2 3 4)And now we expect the equality of

`test3`and

`test4`:

> test3 = entropy0 $ do > x <- dist > x > test4 = entropy0 dist + (expectation0 $ do > x <- dist > return $ entropy0 x)There's a more elegant way of writing this. Define:

> left0 dist = entropy0 (join dist) > right0 dist = entropy0 dist+expectation0 (fmap entropy0 dist)Now we expect

`left0 dist`and

`right0 dist`to always be equal. We've almost generalised to something that makes sense in the context of monads other than probability.

**The algebra of a monad**

Here are a couple of important properties of

`expectation0`:

1.

`expectation0 (return d) = d`

2.

`expectation0 (join d) = expectation0 (fmap expectation d)`

In English: the expectation of certainty is just the certain value, and the expectation of an expectation is just the expectation. But these rules are precisely the conditions that define an -algebra, where is a monad.

So let's define a type class:

> class Algebra m a | m -> a where > expectation :: m a -> aWe'll assume that when

`m`is a monad, any instance satisfies the two laws above. Here's the instance for probability:

> instance Algebra P Float where > expectation (P xs) = sum [x*p | (x, p) <- xs]In keeping with the notion that entropy measure diversity let's also define:

> class Diverse m r | m -> r where > entropy :: m x -> rwith the instance:

> instance Diverse P Float where > entropy (P xs) = -sum [if p==0 then 0 else p*log p/log 2.0 | (_, p) <- xs]It's not clear what laws we need but for now we'll assume a generalised entropy satisfies

`left dist == right dist`:

> left dist = entropy (join dist) > right dist = entropy dist+expectation (fmap entropy dist)We'll call that the generalised grouping law.

**Binary trees**

It's not hard to find other structures that satisfy these laws if we cheat and use alternative structures to represent probabilities. For example We can make

`Tree`an instance by assuming

`Fork`represents a 50/50 chance of going one way or another:

> data Tree a = Leaf a | Fork (Tree a) (Tree a) deriving Show > instance Functor Tree where > fmap f (Leaf a) = Leaf (f a) > fmap f (Fork l r) = Fork (fmap f l) (fmap f r) > instance Monad Tree where > return x = Leaf x > Leaf a >>= f = f a > Fork l r >>= f = Fork (l >>= f) (r >>= f) > instance Algebra Tree Float where > expectation (Leaf a) = a > expectation (Fork l r) = 0.5*expectation l+0.5*expectation r > instance Diverse Tree Float where > entropy (Leaf a) = 0 > entropy (Fork l r) = 1+0.5*entropy l+0.5*entropy r

**Lists**

We could make non-empty lists into an instance by assuming a uniform distribution on the list. But another way to measure the diversity is simply to count the elements. We subtract one so that

`[x]`corresponds to diversity zero. This subtraction gives us a non-trivial instance:

> newtype L a = L [a] deriving (Show, Monad, Functor) > instance Algebra L Int where > expectation (L xs) = sum xs > instance Diverse L Int where > entropy (L xs) = length xs-1

**Tsallis entropy**

There are measures of diversity for probability distributions that are distinct from Shannon entropy. An example is Tsallis entropy. At this point I'd like a family of types parametrised by reals but Haskell doesn't support dependent types. So I'll just fix a real number

`q`and we can define:

> q = 2.5 > data T a = T [(a, Float)] deriving Show > instance Functor T where > fmap f (T xs) = T [(f a, p) | (a, p) <- xs] > instance Monad T where > return x = T [(x, 1)] > T xss >>= f = T [(y, p*q) | (pxs, p) <- xss, let T ys = f pxs, (y, q) <- ys] > instance Algebra T Float where > expectation (T xs) = sum [x*p**q | (x, p) <- xs] > instance Diverse T Float where > entropy (T xs) = (1-sum [p**q | (_, p) <- xs])/(q-1)And again we find our generalised grouping rule for entropy holds.

**Operads**

This is all derived from Tom Leinster's post last year at the n-category cafe. As I talked about here there's a close relationship between monads and operads. Operads area a bit like container monads where the containers don't contain anything, but just have holes where contents could be placed. This makes operads a better place to work because you don't have the awkward issue I started with: having to disallow lists of value/probability pairs where the same value can appear more than once. Nonetheless, in (unrestricted) Haskell monads you don't have

`Eq`available so you can't actually have definitions of

`return`or

`>>=`that can notice the equality of two elements. If such definitions were possible, the grouping law would no longer work as stated above.

**Crossed homomorphisms**

The generalised grouping law even makes sense for very different monads. For the Reader monad the law gives the definition of a crossed homomorphism. It's pretty weird seeing a notion from group cohomology emerge like this and I recommend skipping to the final section unless you care about this sort of thing. But if you do, this is related to research I did a long time ago. This is to test that the Schwarzian derivative really does give rise to a crossed homomorphism.

Firstly let me set up some automatic differentiation code:

> data D a = D { re::a, im::a } deriving (Show, Ord, Eq) > instance Num a => Num (D a) where > fromInteger n = D (fromInteger n) 0 > D a a'+D b b' = D (a+b) (a'+b') > D a a'*D b b' = D (a*b) (a*b'+a'*b) > D a a'-D b b' = D (a-b) (a'-b') > instance Fractional a => Fractional (D a) where > fromRational n = D (fromRational n) 0 > D a a'/D b b' = let q = 1/b in D (a*q) ((-a*b'+a'*b)*q*q) > lift x = D x 0 > d f x = im (f (D x 1)) > raised f = re . f . lift > raised2 = raised . raised > raised3 = raised2 . raisedThe

`Cn`are the

`n`-times (automatically) differentiable functions. Unfortunately the

`Endo`defined in

`Data.Monoid`acts the wrong way round from what I want so I need a

`Dual`:

> type C1 = Dual (Endo (D Double)) > type C3 = Dual (Endo (D (D (D Double)))) > type C4 = Dual (Endo (D (D (D (D Double))))) > instance Eq (Endo (D Double)) > instance Ord (Endo (D Double))A silly

`Show`instance that simply evaluates a function at a number I chose randomly: 1.234.

> instance Show (Endo (D Double)) where > show (Endo f) = show (f 1.234) > instance Num C1 where > fromInteger n = Dual (Endo (\x -> fromInteger n)) > Dual (Endo f)+Dual (Endo g) = Dual (Endo (\x -> f x + g x)) > Dual (Endo f)-Dual (Endo g) = Dual (Endo (\x -> f x - g x)) > Dual (Endo f)*Dual (Endo g) = Dual (Endo (\x -> f x * g x)) > instance Fractional C1 where > fromRational n = Dual (Endo (\x -> fromRational n)) > Dual (Endo f)/Dual (Endo g) = Dual (Endo (\x -> f x / g x)) > newtype Q a = Q (Writer C4 a) deriving (Monad, Functor)We can give

`Q a`a geometrical interpretation. The underlying type is a pair

`(a, C4)`. If we think of elements of

`C4`as charts charts on a piece of Riemann surface then for any , an element of

`(a, C4)`represents a local piece of a section of the th tensor power of the canonical bundle. Ie. we can think of it as representing . I'll concentrate on the case which gives quadratic differentials. We can think of an element of

`((a, C4), C4)`as forms where we're composing two charts. We can collapse down to an ordinary chart by using the chain rule. Here's the code:

> instance Algebra Q C1 where > expectation (Q ma) = let (Dual (Endo a), Dual (Endo f)) = runWriter ma > in Dual (Endo (\x -> a (raised3 f x)*(raised2 (d f) x)^2))Now we can define the Schwarzian derivative:

> schwarzian f x = let f0 = raised3 f x > f1 = raised2 (d f) x > f2 = raised (d $ d f) x > f3 = (d $ d $ d f) x > in f3/f1-1.5*(f2/f1)^2And somwehat bizarrely, we now have a generalised entropy:

> instance Diverse Q C1 where > entropy (Q ma) = let (_, Dual (Endo f)) = runWriter ma > in Dual (Endo (\x -> schwarzian f x))This is the construction that gives rise to the Virasoro algebra which plays such an important role in String Theory.

**Some tests**

And here's a bunch of tests. I'd have used

`QuickCheck`but it won't install for me today...

> test :: (Algebra m t, Diverse m t, Num t, Functor m, Monad m) => m (m x) -> IO () > test x = do > print (left x, right x) > main = do > test $ L [L [1, 2, 3], L [2, 3, 4], L [1], L [5], L [2, 7::Int]] > test $ P [(P [(0, 0.5), (1, 0.5)], 0.5), (P [(2, 0.5), (3::Int, 0.5)], 0.5::Float)] > test $ T [(T [(0, 0.5), (1, 0.5)], 0.5), (T [(2, 0.5), (3::Int, 0.5)], 0.5::Float)] > test $ Leaf (Leaf 1 `Fork` Leaf 2) `Fork` Leaf (Leaf 3 `Fork` (Leaf 4 `Fork` Leaf 5)) > test $ (Q (writer > (Q (writer (Dual (Endo (\x -> x)), > Dual (Endo (\x -> x^2+1)))), > Dual (Endo (\x -> (2+x)/(3+x*x))))) :: Q (Q C3))