I recently submitted (with Arvind Agarwal and Jeff Phillips) a paper on a unified view of multidimensional scaling. It's primarily empirical, in that the main technique is a heuristic that has many nice properties (including providing a single technique for optimizing a whole host of cost measures for MDS). For anyone interested though, there's also a nice JL-style theorem for dimensionality reduction from (hi-D) sphere to (lo-D) sphere, which gets the "right" bound for the number of dimensions.

But this post isn't really about the paper. It's more about the challenges of doing good empirical work when you're trained to think formally about problems. This post is influenced by Michael Mitzenmacher's exhortations (one, two) on the importance of implementations, and Mikkel Thorup's guest lament on the lack of appreciation for simple algorithmic results that have major consequences in practice.

So you're looking at practical ramifications of some nice theory result, or you're trying to apply formal algorithmic tools to some application problem. If you're lucky, the problem doesn't have an existing base of heuristics to compare against, and so even a straight-up implementation of your ideas is a reasonable contribution.

Of course, you're rarely this lucky, and there's usually some mish-mash of heuristics to compare against. Some of them might be head-scratchingly weird, in the "why on earth should this work" category, and some are possibly more principled. At any rate, you go and implement your idea, and you suddenly realize to your horror that worst-case bounds don't mean s*** when your principled method is ten times slower than the crazy heuristics. So now what ?

The central point that I want to make here is that while actually paying attention to implementation issues is of immense value if we want people to actually care about theoretical work, I don't think we get the right kind of training to do it well.

First off, we lack training in various heuristic design strategies. Now I don't actually mean the kinds of heuristics that one might come across in the Kleinberg-Tardos book (local search, annealing, and the like). I mean the much larger body of principle-driven heuristics that the optimization community is quite familiar with. Without even thinking too hard, it's easy to list heuristics like Newton's method, conjugate gradients, alternating optimizations, matching pursuit, majorization, the frank-wolfe method, iteratively reweighted least-squares, (and I could keep going on...)

Of course you might point out that I seem to know all about these heuristics. Well, not quite. The second problem is that even if one knows about these methods, that's not the same thing as having a good intuitive feel for when and why they work well. Case in point: one of the thorns in our side in this paper was a heuristic for MDS called SMACOF. It's a nice technique based on majorization, and although it's a heuristic, it works pretty darn well most of the time and takes a lot of effort to beat, even though there's no clear reason (at least to me) why it should work so well. The only real way to get a sense for how different heuristics work is to implement them all really, or at least have the right set of MATLAB/C/C++ tools lying around. I notice that ML folks tend to do this a lot.

The third problem that often comes up is by far the most irritating one: the actual cost function you're optimizing doesn't even matter that much at all. Returning again to our paper, there are many ways to define the "error" when embedding a metric into another metric. The traditional theoryCS way looks at dilation/contraction: the worst-case ratio between distances in the original and target space. Most variations on MDS actually look at an average difference (take the difference between the distances, and average some function of this). As anyone who mucks around with metric spaces will tell you, the actual error function used can make a huge difference to the complexity of the problem, the ability to approximate, and so on and so forth.

But here's the thing we discovered: it's actually possible to run heuristics designed explicitly for one kind of error function that do just great for another kind of error function, and it takes a lot of work to construct examples that that demonstrate the difference.

These points tie together in a more general theme. I was reading a post by Jake Abernathy at Inherent Uncertainty, and he makes a valuable point about the difference between algorithms/theory culture and ML culture (although ML could be replaced by other applied areas like db/data mining as well). His point is in theoryCS, we are problem-centric: the goal is to prove results about problems, and taxonomize them well. Improve asymptotic run-time - great ! get a better approximation ratio - excellent ! reimplement the same algorithm with the same running time to get a better behaviour in practice - hmmm. This is in contrast (as he puts it) to a lot of ML research, where the algorithmic technique comes first, and it's later on that some results are generated to go along with it.

This of course drives us nuts: NOT EVERY CLUSTERING PROBLEM SHOULD BE SOLVED WITH k-MEANS ! (phew - I feel better now). But if you reexamine this situation from the setting of applied situations, you realize its utility. I want to solve a problem - I pick up some off-the-shelf algorithm. Maybe it's doesn't solve my problem exactly; maybe it solves a related problem. But it's either this, or some very complicated theoretical method that has no extant implementation, and is optimizing for a worst-case that I might never encounter. What do I do then ?

This is not a rant about worst-case analysis. Far from it. It's not a rant about O() notation either. What I'm merely trying to say is that a focus on worst-case analysis, asymptotic improvements, and provable guarantees, while essential to the enterprise of doing theory, leaves us little room for the kind of experience needed to do effective practical implementation of our ideas.