## Saturday, June 11, 2011

### Clustering as compression

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)
Numquam ponenda est pluralitas sine necessitate.
-- Plurality must never be posited without necessity -- William of Ockham.

Clustering as compression generalizes mixture model clustering, as well as distance based methods. The central idea in this view of clustering is Occam's razor. Can you express the data as concisely as possible in a lossy manner ?

The devil in the details here is the definition of "concisely". There are two different notions of bounded space that are common, but they are actually not that different.

The information-theoretic approach.

The information theoretic approach dates back to Shannon. Consider sending information across a pipe.  How big on average does the pipe need to be in order to transmit the information ? One of Shannon's celebrated results is that this can be quantified by the mutual information of the channel, or the amount of shared information contained between the input and output.

Thinking of mutual information as a measure of space is quite easy once you get used to it. It's helpful to consider the boundary cases. If the channel merely outputs a random flip of  the input, then the output has no information about the input and the mutual information is zero, corresponding to the idea you don't need to transmit much information through the channel.

On the other hand, if the output is a deterministic function of the input, then the mutual information of the channel equals the entropy of the source, corresponding to the idea that the channel needs to send over a complete description of the input to compute the output correctly.

What does this have to do with clustering ?  Think of the channel as the process by which the input points are mapped to clusters. In particular, the channel can be encoded by the conditional probability $p(t | x)$ of assigning a point x to cluster t. Once we have that, the mutual information $I(T;X)$ between the set of clusters T and the set of point X measures how concisely T represents X.

There's an intuitive explanation for this as well. Suppose that points were assigned to exactly one cluster each. Then I(T;X) equals the entropy of T, which is just the average number of bits needed to write down a description of T.

But what about the quality of the embedding ? Given some distance function on the input and output domains, you can compute the expected error as the sum of p(t | x) p(x) d(x,t) over all x,t. The study of how this varies with I(T;X) yields what's known in information theory as rate-distortion theory, and for an appropriate choice of the distance measure $d$ leads to the well known Information Bottleneck Method

Kolmogorov complexity.

The second approach takes the Kolmogorov view: that compactness of description can be expressed as the size of the smallest program expressing the data.  In many ways, the Kolmogorov complexity of a string serves as a point-wise proxy for entropy, and the equivalent of mutual information can be defined as well (for more on this, see the note by Grunwald and Vitanyi)

While the rate-distortion framework can be extended to the Kolmogorov setting, it's a lot more technical, and is hard to explain in a short note such as this. An easier application of the Kolmogorov method is due to Cilibrasi and Vitanyi, who showed how to do clustering by defining a distance metric based on compression (roughly, as the complement of a normalized mutual information), and then doing hierarchical clustering.

In summary, the main value of the compression-based viewpoint is the idea of a space-quality tradeoff, as opposed to a more artificial k vs quality tradeoff.  What's nice about the information-theoretic approach is that the units for space and quality are the same, and so you don't need ugly balancing constants to compare them.

p.s There are those who might argue that the information-theoretic view of clustering is merely a kind of mixture density estimation. In a sense, this is true: specifically, the information bottleneck method can be viewed as estimating a mixture of multinomials if we reinterpret the Kullback-Leibler divergence as the associated Bregman divergence for the multinomial family . This in turn connects to methods like LDA and probabilistic PCA. However, I'd argue that the viewpoint of compression is quite different, even though the end-result looks similar, and it's the viewpoint that's valuable here.