Tuesday, January 09, 2007

SODA 2007: Day 2

David has been posting detailed impressions of specific papers. Do check them out.

He mentions the constant-degree expander result of Alon, Schwartz and Shapira. I was at that talk today morning, and it's a very neat result. Expander graphs have a long and fascinating history (see the Linial-Wigderson course notes for details), and I won't try to summarize it here. Basically, the goal for a long time has been to construct reasonably sparse, constant degree expanders that also have "simple" proofs of expansion. Many of the constructions to date were either non-constructive (like the random expanders) or had very sophisticated proofs of expansion. Luca Trevisan had two very detailed posts that give an idea of how complicated some of these arguments can be.

The main contribution of the Alon/Schwartz/Shapira paper is the explicit construction of an expander that is both small (constant degree) and which has a "simple" proof of expansion. I should mention that the celebrated zig-zag product of Reingold-Vadhan-Wigderson already does this. However, their proof (and basically all proofs of expansion) rely on the spectral analysis of the graphs, using the relation between expansion and the gap between the first and second eigenvalues of the adjacency matrix of a graph.

This paper uses a graph product called a replacement product, and presents an elementary (i.e combinatorial) proof that the replacement product of two expanders is also an expander. With that in hand, they construct three graphs, and with two replacement products, get a constant-degree expander.

The invited talk today was by Monika Henzinger, of Google and EPFL. This talk was the "applied" talk (SODA usually has one such); Monika talked about algorithmic success stories at Google, discussing PageRank (and HITS), detecting document duplicates, and load balancing queries on servers. Each of these topics deserves a post on their own (the work on detecting duplicates has some particularly elegant ideas), so I don't want to go into detail here.

There's a point worth making here. The success of PageRank and other methods for search is really a success story for algorithmic modelling, rather than algorithms per se.

What do I mean by this ? I mean that the key success of PageRank, to take an example, was the idea that pages could be viewed as nodes, and edges as transitions in a Markov chain, and that relevance (or PageRank) could be modelled as a kind of "return probability". Of course, once you do this, all your theoretical machinery kicks in, and you can prove bounds on convergence, design fast algorithms, etc. But the main step was the modelling step, where you took the raw data (web pages) and viewed them in a certain abstract framework. For those of you who don't remember this, the existing paradigm of search at the time was text-based IR, and Altavista was the main exemplar of this. What Google was proposing was a very different way of viewing documents and the problem of relevance.

This is a common, and yet widely misunderstood, aspect of doing "applied" algorithms. You can define all kinds of problems, and write all kinds of papers proving results about these problems. But the mathematical tools developed for solving these problems will always take a backseat to the essentially scientific question of whether the problems and models fit the data correctly or not.

There are many domains where this is not true; cryptography is one domain where provable guarantees are not just nice, but are the crucial element of a secure system. But the success of algorithms in Web search come not from knowing to simulate a Markov chain efficiently, but from realizing that web documents are essentially nodes in a gigantic graph, and that the problem of ranking pages can be translated to a mathematical abstraction on graphs. As an aside, one of the things that the Kleinberg/Tardos textbook appears to do well is walk students through the process of problem abstraction, via the extended real-world exercises.

Understanding this aspect of data modelling changes the questions somewhat. The issue now is not, "Is this the most efficient algorithm for the problem", but rather, "Is this the right problem for the data" ? The first question will become relevant only when the second one is answered satistfactorily, more akin to a scientific discovery of truth than a mathematical one.

Outtakes:
  • (Thanks to Vijay Kumar) You could, at some point, buy a watch on Amazon.com for the heavily discounted (50% off) price of $499,999. The comments on the product page are hilarious.
  • What's the title of Britney Spears' only SODA paper ? "Stable Marriage is Hard"
Post a Comment

Disqus for The Geomblog