A Neighborhood of Infinity: Expectation-Maximization with Less Arbitrariness

Introduction

Google have stopped supporting the Chart API so all of the mathematics notation below is missing. There is a PDF version of this article at GitHub.

There are many introductions to the Expectation-Maximisation algorithm. Unfortunately every one I could find uses arbitrary seeming tricks that seem to be plucked out of a hat by magic. They can all be justified in retrospect, but I find it more useful to learn from reusable techniques that you can apply to further problems. Examples of tricks I've seen used are:

Using Jensen's inequality. It's easy to find inequalities that apply in any situation. But there are often many ways to apply them. Why apply it to this way of writing this expression and not that one which is equal?
Substituting $1=A/A$ in the middle of an expression. Again, you can use $1=A/A$ just about anywhere. Why choose this $A$ at this time? Similarly I found derivations that insert a $B-B$ into an expression.
Majorisation-Minimisation. This is a great technique, but involves choosing a function that majorises another. There are so many ways to do this, it's hard to imagine any general purpose method that tells you how to narrow down the choice.

My goal is to fill in the details of one key step in the derivation of the EM algorithm in a way that makes it inevitable rather than arbitrary. There's nothing original here, I'm merely expanding on a stackexchange answer.

Generalities about EM

The EM algorithm seeks to construct a maximum likelihood estimator (MLE) with a twist: there are some variables in the system that we can't observe.

First assume no hidden variables. We assume there is a vector of parameters $\theta=(\theta_i)$ that defines some model. We make some observations $x=(x_j)$ . We have a probability density $P(x|\theta)$ that depends on $\theta$ . The likelihood of $\theta$ given the observations $x$ is $l(\theta|x)=P(x|\theta)$ . The maximum likelhood estimator for $\theta$ is the choice of $\theta$ that maximises $l(\theta|x)$ for the $x$ we have observed.

Now suppose there are also some variables $z=(z_k)$ that we didn't get to observe. We assume a density $P(x,z|\theta)$ . We now have

$P(x|\theta)=\sum_z P(x,z|\theta)$

where we sum over all possible values of $z$ . The MLE approach says we now need to maximise

$l(\theta|x)=\sum_z P(x,z|\theta).$

One of the things that is a challenge here is that the components of $\theta$ might be mixed up among the terms in the sum. If, instead, each term only referred to its own unique block of $\theta_i$ , then the maximisation would be easier as we could maximise each term independently of the others. Here's how we might move in that direction. Consider instead the log-likelihood

$\log l(\theta|x)=\log\sum_z P(x,z|\theta).$

Now imagine that by magic we could commute the logarithm with the sum. We'd need to maximise

$\sum_z \log P(x,z|\theta).$

One reason this would be to our advantage is that $P(x,z|\theta)$ often takes the form $\exp(f(x,z,\theta))$ where $f$ is a simple function to optimise. In addition, $f$ may break up as a sum of terms, each with its own block of $\theta_i$ 's. Moving the logarithm inside the sum would give us something we could easily maximise term by term. What's more, the $P(x,z|\theta)$ for each $z$ is often a standard probability distribution whose likelihood we already know how to maximise. But, of course, we can't just move that logarithm in.

Maximisation by proxy

Sometimes a function is too hard to optimise directly. But if we have a guess for an optimum, we can replace our function with a proxy function that approximates it in the neighbourhood of our guess and optimise that instead. That will give us a new guess and we can continue from there. This is the basis of gradient descent. Suppose $f$ is a differentiable function in a neighbourhood of $x_0$ . Then around $x_0$ we have

$f(x) \approx f(x_0)+f'(x_0)\cdot (x-x_0).$

We can try optimising $f(x_0)+f'(x_0)\cdot (x-x_0)$ with respect to $x$ within a neighbourhood of $x_0$ . If we pick a small circular neighbourhood then the optimal value will be in the direction of steepest descent. (Note that picking a circular neighbourhood is itself a somewhat arbitrary step, but that's another story.) For gradient descent we're choosing $f(x_0)+f'(x_0)\cdot (x-x_0)$ because it matches both the value and derivatives of $f$ at $x_0$ . We could go further and optimise a proxy that shares second derivatives too, and that leads to methods based on Newton-Raphson iteration.

We want our logarithm of a sum to be a sum of logarithms. But instead we'll settle for a proxy function that is a sum of logarithms. We'll make the derivatives of the proxy match those of the original function precisely so we're not making an arbitrary choice.

Write

$\log l(\theta|x) = \log\sum_z P(x,z|\theta) \approx \sum_z\beta_z\log P(x,z|\theta)+\mbox{constant}.$

The $\beta_z$ are constants we'll determine. We want to match the derivatives on either side of the $\approx$ at $\theta=\theta_0$ :

$\frac{\partial \log l(\theta_0|x)}{\partial\theta_0}$ $=\frac{1}{l(\theta_0|x)} \frac{\partial l(\theta_0|x)}{\partial\theta_0} =\sum_z\frac{1}{l(\theta_0|x)} \frac{\partial P(x,z|\theta_0)}{\partial\theta_0}.$

On the other hand we have

$\frac{\partial}{\partial\theta_0}\sum_z\beta_z\log P(x,z|\theta_0) =\sum_z\beta_z\frac{1}{P(x,z|\theta_0)}\frac{\partial P(x,z|\theta_0)}{\partial\theta_0}$

To achieve equality we want to make these expressions match. We choose

$\beta_z = \frac{P(x,z|\theta_0)}{l(\theta_0|x)} = \frac{P(x,z|\theta_0)}{P(x|\theta_0)} = P(z|x,\theta_0).$

Our desired proxy function is:

$\sum_z P(z|x,\theta_0)\log P(x,z|\theta)+\mbox{const.} = E_{Z|x,\theta_0}(\log P(x,Z|\theta))+\mbox{const.}$

So the procedure is to take an estimated $\theta_0$ and obtain a new estimate by optimising this proxy function with respect to $\theta$ . This is the standard EM algorithm.

It turns out that this proxy has some other useful properties. For example, because of the concavity of the logarithm, the proxy is always smaller than the original likelihood. This means that when we optimise it we never optimise ``too far'' and that progress optimising the proxy is always progress optimising the original likelihood. But I don't need to say anything about this as it's all part of the standard literature.

Afterword

As a side effect we have a general purpose optimisation algorithm that has nothing to do with statistics. If your goal is to compute

$\mbox{argmax}_x\sum_i\exp(f_i(x))$

you can iterate, at each step computing

$\mbox{argmax}_x\sum_i\exp(f_i(x_0))f_i(x)$

where $x_0$ is the previous iteration. If the $f_i$ take a convenient form then this may turn out to be much easier.

Note

This was originally written as a PDF using LaTeX. It'll be available here for a while. Some fidelity was lost when converting it to HTML.

3 Comments:

sigfpe said...: Although I use the example of steepest ascent (or descent) to motivate EM, there's an interesting difference pointed out to me by a work colleague.

When using steepest ascent you're using the fact that the linear proxy function matches the original function in a small region. So when you maximise the proxy you need to perform a maximisation in a small region. This is essentially why we typically take small step sizes in the steepest ascent algorithm. This means that steepest ascent can get stuck in local minima.

In the case of EM we similarly ensure that the proxy matches the true objective locally in a small region. However, the concavity of the log function means that the proxy is always less than or equal to the original function. As a result, we don't have to be conservative. Globally maximising the proxy is guaranteed to be safe. Because EM isn't restricted to small steps it can sometimes make big jumps from one local maximum to another. That doesn't mean it'll always find the global maximum of your likelihood. But it is a qualitative difference from steepest ascent.

(Pure Newton-Raphson can also make big jumps. But, unmodified, it's not always a good algorithm because there are no guarantees that the quadratic proxy is always less than the true objective function.); Monday, 17 October, 2016
Ingo Blechschmidt said...: Thank you for sharing this insightful observation!

There is a small typo in the formula for the linear approximation. The derivative has to be multiplied by $(x-x_0)$, not by $x$.; Tuesday, 18 October, 2016
Rosa Z. said...: Hi Dan... I recently came across an old article by you (from 2005) and am responding here instead of there (thinking you might not see it if I add this to an old blog post.) Anyway I came across your article b/c I'm searching for some very specific things, which I didn't find but thought you might be able to point me to...

I am not well-versed in logic, but I am wanting to find out if any of the formal alternative logics include a phenomenon where both A is greater than B (or, A includes B) AND, B is greater than A (or B includes A). This is something that Sir Geoffrey Vickers, in Value Systems & Social Process, describes as "chinese boxes"... and his example is that, from one perspective, science about human beings, is just one aspect of a much larger field of science.... whereas from another perspective, 'doing science' is just one part of what human beings do.

Anyway, I've come across that kind of relationship before, and am curious whether there is any kind of formal logic that explores that...

My second question: from the field of group facilitation... groups often get stuck in a polarity of "either A OR B", which we might depict as a line that includes A at one end and B at the other.... One way of expanding the conversation has been described as "exploring the emergent axis" which could be described as another line, that includes BOTH A and B as well as NEITHER A nor B...

and of course when there are two lines, in Euclidian geometry that defines a whole larger plane, so it greatly expands the field of possibilities under consideration...

just curious where I might look, for any work along these lines... thanks so much!; Thursday, 22 December, 2016

<< Home

A Neighborhood of Infinity

Sunday, October 16, 2016

Expectation-Maximization with Less Arbitrariness

3 Comments:

About Me

Previous Posts