Saturday, March 13, 2010

Choosing the number of clusters II: Diminishing Returns and the ROC curve

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

In the last post, we looked at the elbow method for determining the "right" number of clusters in data. Today, we'll look at generalizations of the elbow method, all still based on the idea of examining the quality-compression tradeoff curve.

For the purpose of discussion, let's imagine that we have a plot in which the quality of the clustering (measured anyway you please, as long as 0 means all items in one cluster, and 1 means all items in separate clusters) is measured along the y-axis, and the representation complexity (i.e the space needed) is measured on the x axis, where again 0 corresponds to all items in a single cluster (least representation cost), and 1 corresponds to all items in separate clusters.

The ideal cluster is located at (0,1): cheapest representation and perfect quality. In general, as we use more and more clusters to organize the data, we can trace out a curve that starts at (0,0) and ends at (1,1). For most sane functions, this cuve is concave, and lives above the diagonal x=y.

This curve contains lots of useful information about the behavior of the data as the clustering evolves. It's often called the ROC curve, in reference to the curve used to capture the tradeoff between false positive and false negatives in classification. But what can we glean from it ?

We can ask what the curve would look like for "unclusterable" data, or data that has no definite sense of the "right number" of clusters. Such data would look self-similar: you could keep zooming in and not find any clear groupings that stood out. It's not too hard to see that data that looked like this would have a ROC curve that hugs the diagonal x=y, because there's a relatively smooth tradeoff between the quality and compressibility (so no elbow).

Conversely, data that does appear to have a definite set of clusters would try to get closer to the (0,1) point before veering off towards (1,1). This suggests a number of criteria, each trying to quantify this deviation from unclusterability.

  • you could measure the area between the curve and the diagonal. This is (essentially) the AUC (area-under-curve) measure. 
  • You could measure the closest distance between (0,1) and the curve. This also gives you a specific point on the curve, which you could then associate with the "best" clustering. 
  • You could also find the point of diminishing returns - the point of slope 1, where the clustering starts costing more to write down, but yields less quality. I've used this as the start point of a more involved procedure for finding a good clustering - more on that later. 
The ROC curve gives a slightly more general perspective on the elbow method. It's still fairly heuristicy, but at least you're trying to quantify the transition from bad clustering to good in somewhat more general terms. 

Ultimately, what both the ROC curve and the elbow method itself are trying to find is some kind of crucial transition (a phase transition almost) where the data "collapses" into its right state. If this sounds like simulated annealing and bifurcation theory to you, you're on the right track. In the next part, I'll talk about annealing and phase-transition-based ideas for finding the right clustering - all fairly ingenious in their own right. 

(ed. note: comments on the last post have convinced me to attempt a post on the nonparametric approaches to clustering, which all appear to involve eating ethnic food. Stay tuned...)
Post a Comment

Disqus for The Geomblog