Expectation-Maximization with Less Arbitrariness
There are many introductions to the Expectation-Maximisation algorithm. Unfortunately every one I could find uses arbitrary seeming tricks that seem to be plucked out of a hat by magic. They can all be justified in retrospect, but I find it more useful to learn from reusable techniques that you can apply to further problems. Examples of tricks I've seen used are:
- Using Jensen's inequality. It's easy to find inequalities that apply in any situation. But there are often many ways to apply them. Why apply it to this way of writing this expression and not that one which is equal?
- Substituting
in the middle of an expression. Again, you can use
just about anywhere. Why choose this
at this time? Similarly I found derivations that insert a
into an expression.
- Majorisation-Minimisation. This is a great technique, but involves choosing a function that majorises another. There are so many ways to do this, it's hard to imagine any general purpose method that tells you how to narrow down the choice.
Generalities about EM
The EM algorithm seeks to construct a maximum likelihood estimator (MLE) with a twist: there are some variables in the system that we can't observe.
First assume no hidden variables.
We assume there is a vector of parameters that defines some model.
We make some observations
.
We have a probability density
that depends on
.
The likelihood of
given the observations
is
.
The maximum likelhood estimator for
is the choice of
that maximises
for the
we have observed.
Now suppose there are also some variables that we didn't get to observe.
We assume a density
.
We now have
where we sum over all possible values of![]()
One of the things that is a challenge here is that the components of![]()
Now imagine that by magic we could commute the logarithm with the sum. We'd need to maximise![]()
One reason this would be to our advantage is that![]()
Maximisation by proxy
Sometimes a function is too hard to optimise directly.
But if we have a guess for an optimum, we can replace our function with a proxy function that approximates it in the neighbourhood of our guess and optimise that instead.
That will give us a new guess and we can continue from there.
This is the basis of gradient descent.
Suppose is a differentiable function in a neighbourhood of
.
Then around
we have
We can try optimising![]()
We want our logarithm of a sum to be a sum of logarithms.
But instead we'll settle for a proxy function that is a sum of logarithms.
We'll make the derivatives of the proxy match those of the original function
precisely so we're not making an arbitrary choice.
Write
The![]()
On the other hand we have![]()
![]()
To achieve equality we want to make these expressions match.
We choose
Our desired proxy function is:![]()
So the procedure is to take an estimated and obtain a new estimate
by optimising this proxy function with respect to
.
This is the standard EM algorithm.
It turns out that this proxy has some other useful properties.
For example, because of the concavity of the logarithm,
the proxy is always smaller than the original likelihood.
This means that when we optimise it we never optimise ``too far''
and that progress optimising the proxy is always progress optimising the
original likelihood.
But I don't need to say anything about this as it's all part of the standard literature.
Afterword
As a side effect we have a general purpose optimisation algorithm that has nothing to do with statistics. If your goal is to compute
you can iterate, at each step computing![]()
where![]()
Note
This was originally written as a PDF using LaTeX. It'll be available here for a while. Some fidelity was lost when converting it to HTML.