Sunday, August 01, 2004

Sorting vs Searching:

In a previous post, I had mentioned an upcoming paper in FOCS 2004 by Franceschini and Grossi titled 'No Sorting? Better Searching!'.

The paper is not yet online, but Roberto Grossi posted a long comment in that post detailing the results in the paper. I reproduce the comment below in full; a short summary is:

They show that the natural way to do searching in a set of ordered elements (i.e via sorting and then doing binary search), makes sense when the cost of comparisons is O(1), but does not make sense when elements are larger (formally, when each element of the list actually consists of k characters, where k is super-constant). They do this by demonstrating a new ordering technique that beats known lower bounds on searching a sorted list; what's nice is that their result is tight as well.

His abstract follows (bold-face is mine):

Sorting is commonly meant as the task of arranging keys in increasing or decreasing order (or small variations of this order). Given n keys underlying a total order, the best organization in an array is maintaining them in sorted order. Searching requires Θ(log n) comparisons in the worst case, which is optimal. We demonstrate that this basic fact in data structures does not hold for the general case of multi-dimensional keys, whose comparison cost is proportional to their length. In previous work [STOC94, STOC95, SICOMP00], Andersson, Hagerup, Håstad and Petersson study the complexity of searching a sorted array of n keys, each of length k, arranged in lexicographic (or alphabetic) order for an arbitrary, possibly unbounded, ordered alphabet. They give sophisticated arguments for proving a tight bound in the worst case for this basic data organization, up to a constant factor, obtaining

Θ[ (k log log n)/(log log (4 + (k log log n)/(log n)) + k + log n ]

character comparisons (or probes). Note that the bound is Θ(log n) when k=1, which is the case that is well known in algorithmics.

We describe a novel permutation of the n keys that is different from the sorted order, and sorting is just the starting point for describing our preprocessing. When keys are stored according to this ``unsorted'' order in the array, the complexity of searching drops to Θ( k + log n) character comparisons (or probes) in the worst case, which is optimal among all possible permutations of the n keys in the array, up to a constant factor. Again, the bound is Θ(log n) when k=1. Jointly with the aforementioned result of Andersson et al., our finding provably shows that keeping k-dimensional keys sorted in an array is not the best data organization for searching. This fact was not observable before by just considering k=O(1) as sorting is an optimal organization in this case.


When the paper is available I will link to it here; one interesting question is: how hard is this "other" order to compute ?

No comments:

Post a Comment

Disqus for The Geomblog