Friday, September 09, 2005

SODA 2006

As you all probably know by now, SODA results are out. Lots of interesting stuff; a rough survey indicates around 16 geometry papers (depending on how you count them). Lots of embedding results as well.

I hope to track down and review some of the papers once they start appearing online. For now, I will entertain you with one of my own (assuming this doesn't break some academic blogosphere code of etiquette ;)).

Deepak Agarwal, Jeff Phillips, Suresh Venkatasubramanian.

The title (or at least the first part) is a tip of the hat towards an earlier paper by Friedman and Fisher titled 'Bump Hunting in High Dimensional Data'. The basic setup is this: you're given points labelled red or blue, and a class of shapes (usually hypercubes, but you could have spheres, pyramids, etc). Given a shape, you count the number of red and blue points in it, and compute a discrepancy function that could be as simple as |#red - #blue|. The goal is to find the shape that maximizes this function (i.e, finds the biggest bump).

Where do the red and blue come from ? One of the more popular applications of this problem is in biosurveillance, where the red points represent incidence of disease, and the blue represents a baseline population (the points may be tagged with weights). A bump is an unusually high or low disease rate with respect to the population.

What discrepancy function do we use ? The "simple" combinatorial discrepancy function described above has actually been looked at before, in the context of learning. Here, the red points are "good" elements of a classifier and the blue points are "bad", and the shape is a hypothesis that has classifies the most points correctly while misclassifying the fewest. Dobkin, Gunopulos and Maass presented exact and approximate algorithms for this problem in arbitrary dimensions.

But in the biosurveillance domain, statistical measures are more commonly used. The idea is that we want the discrepancy function to model a likelihood ratio with respect to some underlying population distribution: how likely was it that I saw as many red and blue points that I did ? The resulting function often resembles a Kullback-Leibler distance, or a chi-squared distance, and one of the most popular measures is the Kulldorff scan statistic, developed by Martin Kulldorff in 1997.

Unfortunately, such measures are painful to work with from an algorithmic point of view. They are not as "nice" as the (linear) combinatorial discrepancy, and thus the best algorithms have been either heuristics or (theoretically) brute force methods that have good empirical behaviour.

What we show is that as long as the discrepancy function is convex (which likelihood-ratio based distances tend to be), you can approximately rewrite it as the envelope of a set of linear functions, which you then whack with the DGM algorithm. The question of course is, how many functions do you need ? This is the tricky part, and involves an asymptotic bound on the eigenvalues of the Hessian of the discrepancy function.

This is a fairly general crank that allows us to solve efficiently (and approximately) a number of (till now) different statistical discrepancy problems (including maximizing with the Kulldorff scan statistic) with the same basic method. There are other results in the paper, which I hope to have online soon.
Post a Comment

Disqus for The Geomblog