I ran a seminar this spring on clustering. I wanted to see if there was a way of organizing material on clustering in a way that was broad without necessarily being exhaustive, but that hit the "eigenvectors" of the field, so to speak. I was surprised that for a topic as important and broad-based as clustering, there was little out there in the way of textbook-level source material to work from. Most texts that I did find were too specialized for my taste, and didn't have the kind of broad perspective of clustering that I felt was critical to really understanding the area.
So I organized my material, and was surprised to find that there actually was a way of collecting the base material on the topic in a way that could be covered in a semester format without leaving crucial pieces out. What I did not try to do:
- Cover the details of the best/greatest/fastest algorithm for each problem. In other words, I made a conscious decision to shy away from theoretical overkill.
- Try to cover all the variants of different clustering methods. I felt that as long as one knew what k-means was, and what spectral clustering was, there was no real need to study papers that tried to put them together in strange ways.
What I did try to do was:
- Where possible, try to evaluate the practical significance of the methods (so a linear time clustering algorithm that actually had a running time doubly exponential in epsilon would not necessarily make the cut, unless it had some other redeeming features)
- Try to view the methods from the perspective of the user: given some data, with some structure, what are your options in terms of clustering algorithms ?
The results of this organization can be found here. I was relatively happy with the outcome, and I think the students were too. In fact, many of the discussions in class were interesting enough that I put them together in a collection of notes.
What I want to do in this occasional series is sketch out the "top-level" view of clustering as I see it, and as inspired by our seminar. My vague idea was that these notes might eventually be converted into some kind of monograph, or at the very least a web document that others might find useful. So I'm going to "test-drive" the ideas here, in the hope that my very well-informed readers will add to the discussion, and improve it (as usually happens).
First up: the basic trinity - k-(center/median/means).