A Neighborhood of Infinity: 2016

Sunday, October 16, 2016

Expectation-Maximization with Less Arbitrariness

Introduction

Google have stopped supporting the Chart API so all of the mathematics notation below is missing. There is a PDF version of this article at GitHub.

There are many introductions to the Expectation-Maximisation algorithm. Unfortunately every one I could find uses arbitrary seeming tricks that seem to be plucked out of a hat by magic. They can all be justified in retrospect, but I find it more useful to learn from reusable techniques that you can apply to further problems. Examples of tricks I've seen used are:

Using Jensen's inequality. It's easy to find inequalities that apply in any situation. But there are often many ways to apply them. Why apply it to this way of writing this expression and not that one which is equal?
Substituting $1=A/A$ in the middle of an expression. Again, you can use $1=A/A$ just about anywhere. Why choose this $A$ at this time? Similarly I found derivations that insert a $B-B$ into an expression.
Majorisation-Minimisation. This is a great technique, but involves choosing a function that majorises another. There are so many ways to do this, it's hard to imagine any general purpose method that tells you how to narrow down the choice.

My goal is to fill in the details of one key step in the derivation of the EM algorithm in a way that makes it inevitable rather than arbitrary. There's nothing original here, I'm merely expanding on a stackexchange answer.

Generalities about EM

The EM algorithm seeks to construct a maximum likelihood estimator (MLE) with a twist: there are some variables in the system that we can't observe.

First assume no hidden variables. We assume there is a vector of parameters $\theta=(\theta_i)$ that defines some model. We make some observations $x=(x_j)$. We have a probability density $P(x|\theta)$ that depends on $\theta$. The likelihood of $\theta$ given the observations $x$ is $l(\theta|x)=P(x|\theta)$. The maximum likelhood estimator for $\theta$ is the choice of $\theta$ that maximises $l(\theta|x)$ for the $x$ we have observed.

Now suppose there are also some variables $z=(z_k)$ that we didn't get to observe. We assume a density $P(x,z|\theta)$. We now have

$P(x|\theta)=\sum_z P(x,z|\theta)$

where we sum over all possible values of $z$. The MLE approach says we now need to maximise

$l(\theta|x)=\sum_z P(x,z|\theta).$

One of the things that is a challenge here is that the components of $\theta$ might be mixed up among the terms in the sum. If, instead, each term only referred to its own unique block of $\theta_i$, then the maximisation would be easier as we could maximise each term independently of the others. Here's how we might move in that direction. Consider instead the log-likelihood

$\log l(\theta|x)=\log\sum_z P(x,z|\theta).$

Now imagine that by magic we could commute the logarithm with the sum. We'd need to maximise

$\sum_z \log P(x,z|\theta).$

One reason this would be to our advantage is that $P(x,z|\theta)$ often takes the form $\exp(f(x,z,\theta))$ where $f$ is a simple function to optimise. In addition, $f$ may break up as a sum of terms, each with its own block of $\theta_i$'s. Moving the logarithm inside the sum would give us something we could easily maximise term by term. What's more, the $P(x,z|\theta)$ for each $z$ is often a standard probability distribution whose likelihood we already know how to maximise. But, of course, we can't just move that logarithm in.

Maximisation by proxy

Sometimes a function is too hard to optimise directly. But if we have a guess for an optimum, we can replace our function with a proxy function that approximates it in the neighbourhood of our guess and optimise that instead. That will give us a new guess and we can continue from there. This is the basis of gradient descent. Suppose $f$ is a differentiable function in a neighbourhood of $x_0$. Then around $x_0$ we have

$f(x) \approx f(x_0) f'(x_0)\cdot (x-x_0).$

We can try optimising $f(x_0) f'(x_0)\cdot (x-x_0)$ with respect to $x$ within a neighbourhood of $x_0$. If we pick a small circular neighbourhood then the optimal value will be in the direction of steepest descent. (Note that picking a circular neighbourhood is itself a somewhat arbitrary step, but that's another story.) For gradient descent we're choosing $f(x_0) f'(x_0)\cdot (x-x_0)$ because it matches both the value and derivatives of $f$ at $x_0$. We could go further and optimise a proxy that shares second derivatives too, and that leads to methods based on Newton-Raphson iteration.

We want our logarithm of a sum to be a sum of logarithms. But instead we'll settle for a proxy function that is a sum of logarithms. We'll make the derivatives of the proxy match those of the original function precisely so we're not making an arbitrary choice.

Write

$\log l(\theta|x) = \log\sum_z P(x,z|\theta) \approx \sum_z\beta_z\log P(x,z|\theta) \text{constant}.$

The $\beta_z$ are constants we'll determine. We want to match the derivatives on either side of the $\approx$ at $\theta=\theta_0$:

$\frac{\partial \log l(\theta_0|x)}{\partial\theta_0}$ $=\frac{1}{l(\theta_0|x)} \frac{\partial l(\theta_0|x)}{\partial\theta_0} =\sum_z\frac{1}{l(\theta_0|x)} \frac{\partial P(x,z|\theta_0)}{\partial\theta_0}.$

On the other hand we have

$\frac{\partial}{\partial\theta_0}\sum_z\beta_z\log P(x,z|\theta_0) =\sum_z\beta_z\frac{1}{P(x,z|\theta_0)}\frac{\partial P(x,z|\theta_0)}{\partial\theta_0}$

To achieve equality we want to make these expressions match. We choose

$\beta_z = \frac{P(x,z|\theta_0)}{l(\theta_0|x)} = \frac{P(x,z|\theta_0)}{P(x|\theta_0)} = P(z|x,\theta_0).$

Our desired proxy function is:

$\sum_z P(z|x,\theta_0)\log P(x,z|\theta) + \text{const.} = E_{Z|x,\theta_0}(\log P(x,Z|\theta)) + \text{const.}$

So the procedure is to take an estimated $\theta_0$ and obtain a new estimate by optimising this proxy function with respect to $\theta$. This is the standard EM algorithm.

It turns out that this proxy has some other useful properties. For example, because of the concavity of the logarithm, the proxy is always smaller than the original likelihood. This means that when we optimise it we never optimise ``too far'' and that progress optimising the proxy is always progress optimising the original likelihood. But I don't need to say anything about this as it's all part of the standard literature.

Afterword

As a side effect we have a general purpose optimisation algorithm that has nothing to do with statistics. If your goal is to compute

$\operatorname{argmax}_x\sum_i\exp(f_i(x))$

you can iterate, at each step computing

$\operatorname{argmax}_x\sum_i\exp(f_i(x_0))f_i(x)$

where $x_0$ is the previous iteration. If the $f_i$ take a convenient form then this may turn out to be much easier.

Note

This was originally written as a PDF using LaTeX. It'll be available here for a while. Some fidelity was lost when converting it to HTML.

Saturday, August 06, 2016

Dimensionful Matrices

Introduction

Programming languages and libraries for numerical work tend not to place a lot of emphasis on the types of their data. For example Matlab, R, Octave, Fortran, and Numpy (but not the now defunct Fortress) all tend to treat their data as plain numbers meaning that any time you have a temperature and a mass, say, there is nothing to prevent you adding them.

I've been wondering how much dimensions (in the sense of dimensional analysis) and units could help with numerical programming. As I pointed out on G+ recently (which is where I post shorter stuff these days), you don't have to limit dimensions to the standard ones of length, mass, time, dollars and so on. Any scale invariance in the equations you're working with can be exploited as a dimension giving you a property that can be statically checked by a compiler.

There are quite a few libraries to statically check dimensions and units now. For example Boost.Units for C++, units for Haskell and even quantities for Idris.

A matrix that breaks things

Even if a language supports dimensions, it's typical to define objects like vectors and matrices as homogeneous containers of quantities. But have a look at the Wikipedia page on the metric tensor. There is a matrix

$\eta=\begin{pmatrix}-c^2&0&0&0\\0&1&0&0\\0&0&1&0\\0&0&0&1\end{pmatrix}$

which has the curious property that 3 entries on the diagonal seem to be dimensionless while the first entry is a squared velocity with dimension $L^2T^{-2}$ . This will break many libraries that support units. An obvious workaround is to switch to use natural units, which is much the same as abandoning the usefulness of dimensions. But there's another way, even if it may be tricky to set up with existing languages.

Heterogeneous vectors and matrices

According to a common convention in physics, a 4-vector $x=(t=x^0,x^1,x^2,x^3)$ has dimensions $[x]=(T, L, L, L)$ where I'm using the convention that we can represent the units of a vector or matrix simply as a vector or matrix of dimensions, and here $T$ is time and $L$ is length. The metric tensor is used like this: $ds^2=x^i\eta_{ij}x^j$ (where I'm using the Einstein summation convention so the $i$ 's and $j$ 's are summed over). If we think of $ds^2$ having units of length squared (it is a pseudo-Riemannian metric after all) then it makes sense to think of $\eta$ having dimensions given by

$[\eta]=\begin{pmatrix}L^2T^{-2}&LT^{-1}&LT^{-1}&LT^{-1}\\LT^{-1}&1&1&1\\LT^{-1}&1&1&1\\LT^{-1}&1&1&1\end{pmatrix}$

We can write this more succinctly as

$[\eta]=(LT^{-1},1,1,1)\otimes(LT^{-1},1,1,1)$

where $\otimes$ is the usual outer product.

I'll use the notation $a:A$ to mean $a$ is of type $A$ . So, for example, $(t,x^1,x^2,x^3):(T,L,L,L)$ . I'll also use pointwise notation for types such as $(A,B,C,D)*(E,F,G,H)=(AE,BF,CG,DH)$ and $(A,B,C)^{-1}=(A^{-1},B^{-1},C^{-1})$ .

Now I can give some general rules. If $m:M$ is a matrix, $x:X$ and $y:Y$ are vectors, and $s:S$ is a scalar, then $y=mx$ only makes sense if $M=Y\otimes X^{-1}$ . Similarly the "inner product" $x^Tmy=s$ only makes sense if $M=X^{-1}\otimes Y^{-1}S$ .

Generic vectors and matrices

Although these kinds of types might be useful if you're dealing with the kind of heterogeneous matrices that appear in relativity, there's another reason they might be useful. If you write code (in the imaginary language that supports these structures and understands dimensions and units) to be as generic as possible in the types of the vector and matrix entries, failures to type check will point out parts of the code where there are hidden assumptions, or even errors, about scaling. For example, consider a routine to find the inverse of a 3 by 3 matrix. Writing this generically as possible means we should write it to operate on a matrix of type $(A,B,C)\otimes(D,E,F)$ , say. The result should have type $(D,E,F)^{-1}\otimes(A,B,C)^{-1}$ . If this type checks when used with a suitably powerful type checker then it means that if we replace the units for type A, say, with units twice as large, it should have no effect on the result, taking into account those units. In this case, it means that if we multiply the numbers of the first row of the input by 0.5 then the numbers of the first column of the output should get multiplied by 2. In fact this is a basic property of matrix inverses. In other words, this mathematical property of matrix inverses is guaranteed by a type system that can handle units and heterogeneous matrices. It would be impossible to write a matrix inverter that type checks and fails to have this property. Unfortunately it's still possible to write a matrix inverter that type checks and is incorrect some other way. Nonetheless this kind of type system would put a very big constraint on the code and is likely to eliminate many sources of error.

An example, briefly sketched

I thought I'd look at an actual example of a matrix inverter to see what would happen if I used a type checker like the one I've described. I looked at the conjugate gradient method. At the Wikipedia page, note the line

$\alpha_k:=\frac{\mathbf{r}_k^T\mathbf{r}_{k+1}}{\mathbf{p}_k^T\mathbf{Ap}_k}$

This would immediately fail to type check because if $r$ is of generic vector type $R=(R_1,\ldots,R_n)$ then $R_1^2$ isn't the same type as $R_2^2$ so they can't be added. I won't go into any of the details but the easiest way to patch up this code to make it type check is to introduce a new matrix $P$ of type $R^{-1}\otimes R^{-1}$ and besides using it to make this inner product work (replacing the numerator by $\mathbf{r}_k^T P\mathbf{r}_{k+1}$ ) we also use $P$ anywhere in the code we need to convert a vector of type $R$ to a vector of type $R^{-1}$ . If you try to do this as sparingly as possible you'll end up with a modified algorithm. But at first this seems weird. Why should this matrix inverse routine rely on someone passing in a second matrix to make it type check? And what is this new algorithm anyway? Well scroll down the Wikipedia page and you get to the preconditioned conjugate gradient algorithm. The extra matrix we need to pass in is the preconditioner. This second algorithm would type check. Preconditioned conjugate gradient, with a suitable preconditioner, generally performs better than pure conjugate gradient. So in this case we're getting slightly more than a check on our code's correctness. The type checker for our imaginary language would give a hint on how to make the code perform better. There's a reason for this. The original conjugate gradient algorithm is implicitly making a choice of units that sets scales along the axes. These determine the course taken by the algorithm. It's not at all clear that picking these scalings randomly (which is in effect what you're doing if you throw a random problem at the algorithm) is any good. It's better to pick a preconditioner adapted to the scale of the problem and the type checker is hinting (or would be if it existed) that you need to do this. Compare with the gradient descent algorithm whose scaling problems are better known.