Friday, July 03, 2009

Clustering: k-means...

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

k-means (also known as Lloyd's algorithm) is probably the most well known clustering algorithm, and as such, deserves some discussion of its own. We've already seen that it's part of the "I don't like you" family of clustering formulations. How does the method work ?

We start with a set of points in R^d. The algorithm is really simple: Fix a collection of k centers. Now perform the following alternating optimization till you reach a steady state: (1) assign each point to its nearest center (2) for each group of points assigned to the same center, compute a new center by taking the centroid of the points.

More importantly, what problem does this algorithm solve ? The quick answer is: none ! Yes, that's right: k-means gives no answer with any provable guarantee for a generic set of points for any cost measure. Pretty dismal, isn't it ? And yet, it's a really popular algorithm, a state of affairs that generally drives theoreticians to tears. So what's the deal ? The answer, as always, lies in the details, and is a good story of how theoretical work has managed to make some sense of an algorithm that was notoriously hard to analyze properly.

The underlying problem that k-means attempts to solve is the one I mentioned in the last post: let a cluster cost be the sum of squared distances to the cluster center, and minimize the sum of cluster costs over all k-clusterings. Now here are some of the bad things that we CAN say about k-means:
• It might take exponential time to converge, even in the plane
• It can get stuck in a local minimum that has an arbitrarily bad cost.

Running Time:
Let's first consider the problem of guaranteeing running time. Although k-means can take exponential time to converge, a smoothed complexity result says that w.h.p, a perturbed input will converge in polynomial time (for those not familiar with the concept, the smoothed complexity of a problem is its (expected) running time after the inputs have been perturbed randomly by a small amount). The smoothed complexity results suggest one reason why k-means converges quickly "in practice". The kinds of careful constructions (high dimensional, and low dimensional) needed to force super polynomial convergence times are very artificial.

A crucial initialization:

So what about the quality of the results produced ? One unspecified aspect of the k-means algorithm is how the initial set of k centers is created. Two recent works have indicated that the key to extracting good performance from k-means is by being very careful with this initial choice.The idea is very neat, and really should be getting more play in the experimental community than I think it does.

Recall the Gonzalez algorithm for the k-center problem: pick a point, and then its furthest neighbor, and then the point whose closest neighbor in this set is as far as possible, and so on. Consider the following randomized variant:
Pick the next candidate center with probability proportional to its minimum distance squared from the current set of clusters.
Papers by Arthur and Vassilvitski, and by Ostrovksy, Rabani, Swamy and Schulman both show that this simple seeding strategy allows for provable guarantees on the quality of the solution returned by k-means. The actual results are slightly different, and give two different views of the behaviour of the algorithm:
1. The Ostrovsky et al paper starts with the assumption that the point set is epsilon-separated: the intuition behind this definition is that the cost of a k-clustering is significantly better than the cost of a (k-1)-clustering (controlled by a parameter epsilon). Under these conditions, they show that the initial seeding gives a PTAS for the clustering problem.

I'm eliding a number of details about their algorithm. One interesting feature of their algorithm is what they do after the initial seeding: rather than run Lloyd's iterative step, they fix a radius around each center (different for each center) and compute the centroid of the portion of the set within this ball. They also use a more complicated procedure to convert this into a PTAS, and in my mind that makes their approach a lot less practical.

2. The Arthur/Vassilvitski paper starts with the same initialization procedure, but they assume no epsilon-separation result. Their algorithm is simply: run the initialization, and then run regular k-means. The guarantee they get is weaker (a log k approximation ratio), but they also provide empirical results showing the superiority of their approach compared to standard k-means. I think it'd be interesting to compare their approach with the 'ball k-means' approach by Ostrovsky et al to see which work better on benchmark data sets.

Another pleasing property of the k-means algorithm is its relationship to mixture modelling and the EM algorithm. I'll have more to say about EM later on, but the gist of the matter is this: The twin steps of finding new centroids and assigning points to nearest neighbours have direct analogs in the world of maximum likelihood concepts and distribution-generated data. There is a general procedure that takes a given exponential family (Gaussian, Poisson and the like), and generates a distance function d such that the density described by the distribution can be expressed as exp(-d(x, mu)), where mu is the mean of the distribution, and x is the point whose probability we're evaluating. ML folks will recognize the Bregman distances here, and the punchline is that for Gaussian distributions, the corresponding distance function is squared Euclidean distance (the exact function used in the k-means evaluation).

What does this all mean ? It means that the k-means process has a semantic meaning in the context of clustering data drawn from a mixture of Gaussians: what's even neater is that a k-means-like algorithm works for data drawn from other distributions, as long as you replace squared Euclidean distance by the appropriate (Bregman) distance function.

Apart from the simplicity of the algorithm, there are some aspects that make it attractive, regardless of lack of guaranteed bounds. From a heuristic-design standpoint, k-means is a particularly atractive example of a technique that's often called alternating optimization.

Notes:
• I was asked if there were standard data sets for benchmarking clustering algorithms. I don't often see papers doing this in a systematic manner, but a good source of data sets for experimentation is the UCI Machine Learning Repository: there aren't too many clustering data sets, but they can be found here (look for clustering under the default task).
Next up: hierarchical methods...