One of the things that has haunted me over the years (yes, haunted!) is the geometry of Bregman divergences. A Bregman divergence is a very interesting creature. It's defined as the difference between a convex function and its linear approximation starting at another point:
$$ d_\phi(p,q) = \phi(p) - \phi(q) - \langle \nabla \phi(q), p - q\rangle $$
Because the Bregman divergences are parametrized by the function $\phi$, they yield a family of divergences. The most familiar one is $\ell_2^2$, but the most important one is the Kullback-Leibler divergence. There are a ton of reasons why one should study Bregman divergences, and in a shameless plug I'll refer to you my slides (one, two, three) from a geometry summer school last year. Suffice it to say that it's possible to unify a number of different algorithms in machine learning under a single framework by realizing that they're all particular instances of doing something with a Bregman divergence: two notable examples of this are Adaboost and information-theoretic clustering.
So what's the geometric perspective on Bregman divergences ? They generalize duality !
There's a standard way to think about traditional point-hyperplane duality via a lifting map: take a point, lift it to the paraboloid one dimension up, and find the tangent plane. But suppose we replace the paraboloid by a generic convex function ? what we get is a general convex duality (technically a Fenchel-Legendre duality) defined by
$$\phi^*(u) = \sup_v \langle u,v \rangle - \phi(v)$$
The reason this doesn't come up in Euclidean space is that the mapping is self-dual if we set $\phi(x) = (1/2) \|x\|^2$, which explains the symmetry in $\ell_2^2$.
It turns out that this observation is enough to import much of standard combinatorial geometry over to general Bregman spaces. We can compute convex hulls, Voronoi Diagrams, delaunay triangulations etc, as long as we're careful to keep primal and dual straight. for example, the locus of points equidistant from two points under a Bregman divergence is either a primal straight line or a dual straight line (depending on how you measure things), and so on. This was all elegantly worked out by Nielsen, Nock and Boissonnat a while ago.
Alas, this beautiful connection doesn't help us with approximation problems.
One of the most interesting questions regarding Bregman divergences is solving the near-neighbor problem. This is interesting because the Kullback-Leibler distance is often used (for example) to compare images, and so there's been a lot of empirical work on this problem.
But here's the problem. Let's consider the low dimensional ANN problem. In Euclidean space (or even in spaces of bounded doubling dimension), here's how you solve the problem. You build some kind of quad-tree-like data structure, using the triangle inequality to reason about which cells you need to explore, and using packing bounds to bound the number of cells explored. You also need a crude bound on the near neighbor to start with, and to do this, you use some variant of a ring-tree.
The key ingredients: triangle inequality (twice over) and packing bounds.
Bregman divergences don't even satisfy a directed triangle inequality in general. And to date, we didn't even know how to define packing bounds properly for these directed distances.
In an upcoming paper at SoCG 2012 with my students Amirali Abdullah and John Moeller, we finally figured out how to get some "approximation" of the ANN machinery to work with Bregman divergences, and get logarithmic query times with small space.
If I say so myself, there are some nice ideas in the paper. Firstly, in a somewhat surprising twist, a Bregman divergence satisfies a "reverse triangle inequality" on the line:
\[ d_\phi(a,b) + d_\phi(b,c) \le d_\phi(a, c), a, b, c, \in R\]
This is neat, because it gives packing bounds ! Intuitively, if the sum of lengths of subintervals along an interval is at most the length of the interval, then you can't have too many subintervals.
The next trick is an observation that the square root of a Bregman divergence satisfies a kind of relaxed triangle inequality that we call $\mu$-defectiveness:
\[ |d(x, y) - d(x,z)| \le \mu d(y, z)\]
This allows us to import some of the ring-tree machinery to get a coarse approximation.
And I should mention that the $\mu$ values involved here are quite small. If you're comparing distributions with the KL-divergence, then the value of $\mu$ is less than 2 even quite close to the boundary of the simplex.
Even with all of this in place, the quad-tree argument breaks. This is because of $\mu$-defectiveness: we can't assume that cell sizes are "reasonably large" at some number of levels below the root of the tree, and so our packing bounds look terrible. It takes a lot more work to fix this problem: essentially we exploit second-order structure of the Bregman divergences to bound the cell sizes and get the desired packing bound.
- Most of the complexity of the algorithm is in the analysis: the actual algorithm looks mostly like a Euclidean ANN procedure. While we haven't implemented it, I'm hopeful that when we do, we'll be able enjoy the empirical behavior of our Euclidean cousins.
- We're looking at high dimensional Bregman near neighbors, as well as some other approximate Bregman geometry questions. While our low-D ANN result comes from the "I will break your limbs one by one till you submit" school of algorithms, the hope is that we can start to exploit more of the dual geometry as we learn more.