## Tuesday, May 04, 2010

### Metrics on distributions defined over metric spaces

Semester is over, which means I hope to get back into blogging form again, picking up my clustering series where it left off, and also starting some new rants on problems in shape matching (which more and more to me look like problems in clustering).

But for today, just a little something to ponder. The following situation often occurs in data analysis. You have some data that inhabits a metric space - usually Euclidean space, but that doesn't really matter. You also have distributions over the data, by which I mean some kind of weight vector with one "component" for each data point, and components summing to one. The goal now is to compare these weight vectors in a way that takes into account the structure of the space.

The standard construction that one uses here is the earthmover distance, also known as the Wasserstein distance, or the Monge-Kantorovich distance, or the Mallows distance, or the transportation metric (you pick your favorite one). It's very intuitive (which is probably why it's been invented so many times) and works like this. Imagine piles of earth at each data point, with each pile having mass equalling the weight at that point. We have a "starting configuration" consisting of the first weight vector, and an "ending configuration" consisting of the second weight vector. The goal is to figure out how to "transport" the earth with minimum effort (effort = weight X distance moved) so that the first configuration becomes the second. Formally, this amounts to a generalized matching problem that can be solved via the Hungarian algorithm. The earthmover distance is very popular in computer vision (see the Wikipedia article for details)

Another metric over distributions of the kind above is the Levy-Prokhorov metric, which for two distributions u and v is defined as the smallest e such that on any neighborhood A, the measure of u is within e of the measure of v on A inflated by e (i.e by constructing a ball of size e around A), and vice versa. I haven't seen this metric used much in practice, and it seems hard to compute.

Another approach that I realized recently uses a method that thus far has been used only in shape analysis. I've been working with a shape matching measure called the current distance of late (more on this in another post - it's a fascinating measure). Roughly speaking, it works like this. It starts with a "shape" defined anyway you like (clouds of points, a curve, a surface, whatever), and a similarity function (a positive definite kernel actually) defined on this space. It then allows you to compare these shapes by using the kernel to create a "global signature" that lifts each shape to a point in a Hilbert space, where the induced distance captures the shape distance.

It also works with weighted point sets, which is the relevant point here. Suppose I give you a space with distance defined indirectly via a kernel similarity function (rather than via a distance function). The current distance then gives me a way of comparing distributions over this shape just like the above measures, and the kicker is that this approach is WAY more efficient then any of the above methods, taking near-linear time instead of needing the rather expensive Hungarian algorithm. Moreover, the current distance has a built-in isometric embedding into a Hilbert space, something the earthmover distance cannot have.

If you're curious for more details, wait for my post on the current distance - in the mean time, there are two papers we've written (one online, the other you should email me for) that explore the theory and practice behind the current distance. I'm curious now as to whether the current distance can be used an efficient replacement for the earthmover distance in applications that rely on the EMD, but don't have a natural relationship to shape analysis.

p.s Shape matching in particular, and data analysis in general, is a rich source of ever more exotic metric spaces. I've been working with a number of these, especially in non-Euclidean spaces, and there are lots of interesting algorithms questions here.