## Thursday, January 07, 2010

### Mixture Models: Classification vs clustering

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)
We're all family.

k-means is arguably the most popular clustering algorithm (and one of the top 10 data mining algorithms to boot). It has a close relative though (some might say a generalization) in the expectation-maximization algorithm (or EM).

The EM algorithm represents yet another paradigm shift in how we think about clustering, to the point where it's arguable whether we are clustering anything at all. What the EM algorithm really does is density estimation: it assumes the data being clustered is generated by picking one of k distributions according to some probability distribution (the mixture) and then sampling from this distribution.

In other words, the definition of a cluster is not in terms of relationships betweens pairs of data points any more: it's in terms of how a collection of points behave together as a group, as representative samples from a distribution.

It's not particularly hard to see that the k-means algorithm is really a special case of EM on Gaussians(if I might be permitted an indulgence, a "tropical" limiting case of EM). In general though, the EM algorithm is an alternating optimization that finds the maximum likelihood parameters for distributions that capture the data. In the "E" step, it fixes the current parameters (the current hypothesis for the distributions) and computes the expected (log) likelihood of the data by first estimating the "cluster membership probabilities" of assigning a point to a "cluster" or distribution, and then using this to estimate the overall likelihood function. The "M" step then finds the set of parameters that maximizes the likelihood function. In k-means language, the first step figures out which clusters to assign points to, and the second step assigns new centers to each cluster.

The EM algorithm is very powerful. It's generally known that if the distributions under consideration are from an exponential family, then this method is easy to implement, because the E and M steps reduce to simple operations that have closed forms. Via the usual Bregman connection between divergence functions and distributions, it's also possible to give a purely geometric interpretation to the EM method akin to k-means, but with a different distance structure.

What's even more fascinating is the information-geometric perspective. There's a wonderful Riemannian world in which distributions are points on a manifold with coordinate systems given by their (natural) parameters, and where various information distances can be related to affine connections on said manifold. Too much to go into here, but the key insight is this: the E and M steps of the EM algorithm can be seen as projection operations in primal and dual manifolds, so the EM algorithm really looks like a primal-dual procedure. Way cool !

So is this not clustering ? For a detailed argument along these lines, you should read Hal Daume's analysis (with a nifty example). My take on this is that while density estimation is indeed different to the problem of clustering, I think of mixture-model-based clustering as just a much more "volume-based" approach to doing clustering, you expect not that point associate with each other per se, but that the ensemble of points itself satisfies some global properties.

I'll develop this idea further when I talk about information-theoretic clustering: to me, density-based clustering suggests a new abstraction for thinking about what a clustering ought to mean in the absence of even metric structure, and allows us to make pleasant detours into discrepancy, quasi-randomness and kolmogorov complexity. Stay tuned....

p.s apologies for the long hiatus.