Sunday, April 10, 2011

POTD: A unified approach to Approximate Proximity searching

A Unified approach to Approximate Proximity Searching
Sunil Arya, Guilherme D. da Fonseca, and David M. Mount

Abstract:
The inability to answer proximity queries efficiently for spaces of dimension d > 2 has led to the study of approximation to proximity problems. Several techniques have been proposed to address different approximate proximity problems. In this paper, we present a new and unified approach to proximity searching, which provides efficient solutions for several problems: spherical range queries, idempotent spherical range queries, spherical emptiness queries, and nearest neighbor queries. In contrast to previous data structures, our approach is simple and easy to analyze, providing a clear picture of how to exploit the particular characteristics of each of these problems. As applications of our approach, we provide simple and practical data structures that match the best previous results up to logarithmic factors, as well as advanced data structures that improve over the best previous results for all aforementioned proximity problems.
Notes:
When a problem becomes interesting, papers get written quickly, and techniques start pouring out of the firehose. Not all tricks are needed, and not all machinery is effective, but cleanup only comes later, once the frenzy has settled. Approximate nearest neighbor research is like this: there are many tricks, and lots of machinery, but there are also some core ideas that keep showing up, and are sufficient for many variants of the problem.

Near-neighbor searching in low dimension is "easy" if you're given data that's uniformly sampled. Simple divide-and-conquer will give balanced search trees, and thus low query times. The problem comes when you have regions of sparsity. Informally, they prevent you from making progress as you divide the space up, and so the root-to-leaf length increases, increasing query time.

While the ANN problem in low dimensions is "solved" in that there's a fairly good theoretical understanding of the bounds and tradeoffs needed for query time vs storage, the methods themselves are quite complex. I learnt this first-hand when reading the Har-Peled paper that uses approximate Voronoi diagrams for near-neighbor search in an attempt to reverse-engineer the method for a different setting.

The beauty of the POTD is that it starts with a very simple data structure - the compressed quad tree. It shows that this structure can be used to isolate "sparse" and "dense" regions of space, and uses a hybrid strategy for processing these regions, with core sets to reduce size in dense regions, and optimized data structures for sparse regions (that necessarily only have few points). While the paper itself has no experimental results, I'd imagine that this approach would lend itself far more easily to experimentation.
Post a Comment

Disqus for The Geomblog