Thursday, June 25, 2009

Clustering: An Occasional Series

Clustering is one of those areas that epitomizes the 'ooze' principle in data mining: it's 3 miles wide and an inch deep. Every year, copious ink get spilt on some clustering method or another, and yet it's not really clear what key ideas are driving the field forward.

I ran a seminar this spring on clustering. I wanted to see if there was a way of organizing material on clustering in a way that was broad without necessarily being exhaustive, but that hit the "eigenvectors" of the field, so to speak. I was surprised that for a topic as important and broad-based as clustering, there was little out there in the way of textbook-level source material to work from. Most texts that I did find were too specialized for my taste, and didn't have the kind of broad perspective of clustering that I felt was critical to really understanding the area.

So I organized my material, and was surprised to find that there actually was a way of collecting the base material on the topic in a way that could be covered in a semester format without leaving crucial pieces out. What I did not try to do:
  • Cover the details of the best/greatest/fastest algorithm for each problem. In other words, I made a conscious decision to shy away from theoretical overkill.
  • Try to cover all the variants of different clustering methods. I felt that as long as one knew what k-means was, and what spectral clustering was, there was no real need to study papers that tried to put them together in strange ways.
What I did try to do was:
  • Where possible, try to evaluate the practical significance of the methods (so a linear time clustering algorithm that actually had a running time doubly exponential in epsilon would not necessarily make the cut, unless it had some other redeeming features)
  • Try to view the methods from the perspective of the user: given some data, with some structure, what are your options in terms of clustering algorithms ?
The results of this organization can be found here. I was relatively happy with the outcome, and I think the students were too. In fact, many of the discussions in class were interesting enough that I put them together in a collection of notes.

What I want to do in this occasional series is sketch out the "top-level" view of clustering as I see it, and as inspired by our seminar. My vague idea was that these notes might eventually be converted into some kind of monograph, or at the very least a web document that others might find useful. So I'm going to "test-drive" the ideas here, in the hope that my very well-informed readers will add to the discussion, and improve it (as usually happens).

First up: the basic trinity - k-(center/median/means).
Post a Comment

Disqus for The Geomblog