## Thursday, April 17, 2014

### STOC 2014 announcement.

Howard Karloff writes in to remind everyone that the STOC 2014 early registration deadline is coming up soon (Apr 30 !). Please make sure to register early and often (ok maybe not the last part). There will be tutorials ! workshops ! posters ! papers ! and an off-off-Broadway production of Let It Go, a tragicomic musical about Dick Lipton's doomed effort to stop working on proving P = NP.

At least a constant fraction of the above statements are true.

And if you are still unconvinced, here's a picture of Columbia University, where the workshops and tutorials will take place:

## Monday, April 07, 2014

### Directed isoperimetry and Bregman divergences.

A new cell probe lower bound for Bregman near neighbor search via directed hypercontractivity.

Amirali Abdullah and I have been pondering Bregman divergences for a while now. You can read all about them in my previous post on the topic. While they generalize Euclidean geometry in some nice ways, they are also quite nasty. The most important nastiness that they exhibit is asymmetry:
$D_\phi(x, y) \ne D_\phi(y, x)$
What's worse is that this asymmetry can grow without bound. In particular, we can quantify the degree of asymmetry by the parameter $\mu$:
$\mu = \max_{x,y \in \Delta} \frac{D_\phi(x,y)}{D_\phi(y,x)}$
where $\Delta$ is the domain of interest.

There's been a ton of work on clustering Bregman divergences, and our work on low-dimensional approximate nearest neighbor search. In almost all these results, $\mu$ shows up in the various resources (space and/or time), and it seems hard to get rid of it without making other assumptions.

After our low-D ANN result, we started trying to work on the high-dimensional version, and our thoughts turned to locality-sensitive hashing. After struggling for a while to get something that might be useful, we started pondering lower bounds. In particular, it seemed clear that the worse $\mu$ was, the harder it became to design an algorithm. So could we actually come up with a lower bound that depended on $\mu$ ?

Our first result was via a reduction from the Hamming cube (and $\ell_1$ near neighbors). While these gave us the desired lower bound, it wasn't $\mu$-sensitive. So we went looking for something stronger.

And that's where isoperimetry made an entrance. There's been a long line of results that prove lower bounds for LSH-like data structures for near neighbor search. Roughly speaking, they work in the following manner:

1. Construct a gap distribution: a distribution over inputs and queries such that a query point is very close to its nearest neighbor and very far from any other point in the space. In the $\ell_1$ case, what you want is something like $\epsilon d$ distance to the nearest neighbor, and $\Omega(d)$ distance to the second nearest neighbor.
2. Imagine your data structure to be a function over the Hamming cube (assigning blocks of points to cells) and look at what happens when you perturb it (formally, by defining the function value to be its average over all nearby points)
3. Use isoperimetry (or hypercontractivity) to argue that the function "expands" a lot. What this implies is that points get scattered everywhere, and so any particular cell you probe doesn't have enough information to determine where the true nearest neighbor actually is. This last step can be proved in different ways, either using Fourier methods, or via the use of Poincaré-type inequalities.

Our initial hope was to use this framework to prove our bound, while incorporating terms relating to the asymmetry of Bregman divergences more directly.

This turned out to be hard. The asymmetry destroys many of the nice algebraic properties associated with the noise operator and related inner products. There's even a deeper problem: if you think of an asymmetric noise operator as pushing things "forward" on a cube with one probability and backwards with another, then it's not hard to imagine a scenario where applying it actually ends up concentrating the mass of the function in a smaller region (which would break hypercontractivity).

We needed two things: a way to get around this not-so-small issue that hypercontractivity isn't true in a directed setting, and a way to analyze the asymmetric noise operator. It turned out that we could circumvent the first problem: we could restrict the class of hash functions we considered in a way that didn't significantly change their entropy (only a constant). Alternatively, we could view hypercontractivity as being true "on average".

The second problem was a bit harder. The solution we came up with was to try and link the directed noise operator to a symmetric operator on a biased space. The intuition here was that the directed operator was creating a nonuniform distribution, so maybe starting off with one might give us what we need.

This actually worked ! There are versions of the Bonami-Beckner inequality in biased spaces that look almost just like their counterparts in uniform space (you merely set the probability of a bit being 1 to $p \ne 1/2$, and the Fourier coefficients are defined in terms of $p$).

There was lots more to address even after the isoperimetry result, but you'll have to read the paper for that :).

I should note that in many ways, this was a "Simons baby": Amir visited during one of the workshops for the program on real analysis, and we had fruitful conversations with Elchanan Mossel and Nathan Keller in early stages of this work.

## Thursday, March 20, 2014

### Data Mining, machine learning and statistics.

How does one tell data mining, machine learning and statistics apart ?

If you spend enough time wandering the increasingly crowded landscape of Big Data-istan, you'll come across the warring tribes of Datamine, MachLearn and Stat, whose constant bickering will make you think fondly of the People's front of Judea:

Cosma Shalizi has what I think is a useful delineation of the three tribes that isn't prejudicial to any of them ("Stats is just inefficient learning !", "MachLearn is just the reinvention of statistics!" "DataMine is a series of hacks!"). It goes something like this:

• Data mining is the art of finding patterns in data.
• Statistics is the mathematical science associated with drawing reliable inferences from noisy data
• Machine learning is [the branch of computer science] that develops technology for automated inference (his original characterization was as a branch of engineering).
I like this characterization because it emphasizes the different focus: data mining is driven by applications, machine learning by algorithms, and statistics by mathematical foundations.

This is not to say that the foci don't overlap: there's a lot of algorithm design in data mining and plenty of mathematics in ML. And of course applied stats work is an art as much as a science.

But the primary driving force is captured well.

## Wednesday, March 19, 2014

### Reproducibility in data mining/learning

Is reproducibility in data mining/learning different from reproducibility in systems research ?

+John Regehr and +Shriram Krishnamurthi have discussions (John, Shriram) arising from a new paper on  reproducibility in computer systems research from people at the U. of Arizona. You should read their posts/G+ discussions for the specifics.

But it got me thinking: is there anything intrinsically different about reproducibility in my neck of the woods ? Many concerns are shared:

• The need for a standard computing environment to build and test code: MATLAB and scikit-learn make that a lot easier for us: we don't need to worry too much about details of the computing environment/hardware just to get the code to run.
• Discrepancies between what is stated in the paper and what the code actually does: that's a problem with any experimental work: John in fact mentions one such issue in one of his recent papers.
• Code rot: otherwise known as 'graduation' :). I think if more people used public repos like github from the beginning, some of these problem might go away. I would be remiss if I didn't also plug +Robert Ricci's new system AptLab for helping distribute research.
But here's one that may not be shared: data preparation.

It's no secret that data preparation takes most of the time in a data mining pipeline. This includes
• data cleaning (removing errors, reformatting data)
• feature engineering (deciding how to truncate, modify or retain certain features)
• training (cross-validation levels, what data is used for model building and so on)
As with any of the above, a proper specification can ensure true reproducibility, but there are lots of things we might do to data without necessarily even realizing it (if there are bugs in the transformation pipeline) or without thinking we need to mention it (truncating ranges, defining null values away).

Feature engineering can also be a problem, especially with models that use many features (for example, deep learning systems have lots of "knobs" to tune, and they can be quite sensitive to the knobs).

So one thing I often look for when reviewing such papers is sensitivity: how well can the authors demonstrate robustness with respect to the parameter/algorithm choices. If they can, then I feel much more confident that the result is real and is not just an artifact of a random collection of knob settings combined with twirling around and around holding one's nose and scratching one's ear.

## Tuesday, March 11, 2014

### A short not-review of the first episode of Cosmos

If you've been living on a rogue planet wandering through the far reaches of the Local group, you may not have heard of the Cosmos reboot on FOX with Neil deGrasse Tyson.

We finally got around to seeing the first episode yesterday. I was nervous because I loved the original so much and hoped that my kids would like it as well (as all parents know, you can't show fear or the sharks will smell it :)).

Things I liked:

• The homage to Carl Sagan was done beautifully. If you're going to reboot a franchise, you don't have to pretend that nothing existed before, you know.
• The mix of CG + real person and animation was a good way to tell the different stories. I was a little worried about the denouement of the Giordano Bruno tale because my six-year old was watching, but it was done without being too graphic.
• The graphics were non-cheesy: but then I wasn't ever worried about the quality of graphics in the original. I'm pretty sure that 20 years from now these graphics will look cheesy as well. The year-long calendar of the universe was a fantastic demonstration of the immensity of time.
Things I could live without:

Granted, it's the first episode, so there's no point in being too harsh. But
• If I had any heartstrings ever, they've been pulled out of shape into long strands of spaghetti. The constant sweeping orchestral background made me feel like I was watching NBC's coverage of the Winter Olympics. I almost expected to see a personal profile of how the universe overcame the adversity of the Big Bang to become the star performer it is today.
• As with all modern American shows purporting to be about science, I found the balance between sweeping rhetoric and actual facts to be disturbingly skewed towards the soaring fluffy. Watch some David Attenborough, people ! or even, I don't know, COSMOS itself. But see below...
Overall:
I liked it. No question that I'm watching it again. And the eight-year old loved it ! Thought it was full of information. So maybe my assessment of graphics-to-information ratio is completely off.

The funniest thing: when it was over, this happened:
Kids: "Let's see the next episode now!"
Me: We'll have to wait a week for it.
Kids: "What do you mean, we have to wait a week ?"
Netflix should be proud. Binge watching on demand is now the default mode.

## Tuesday, March 04, 2014

### Two big data talks at Haverford College

I'm giving two talks at Haverford College just before SDM 2014 in Philadelphia. If you're in the area. stop by and say hello !

## Thursday, February 06, 2014

### On "reverse engineering" algorithms, contextual readings, and the invisible choices that add value to research.

tl;dr: Understanding the value in a piece of work (a paper, an algorithm, a system) is not just a matter of understanding the various parts and how they fit. It is also a matter of understanding why the problem is decomposed that way. This observation has implications for how we read, write and evaluate research.

Nick Seaver is an cultural anthropologist at UC Irvine.  He just wrote an article at Medium on reverse engineering algorithms. In it he distinguished between the factual reconstruction of an algorithm ("how does it work") from a more probing examination of the way it was designed ("why it is broken down in a particular way").

He comes to a conclusion that is quite sound and not at all controversial:
I want to suggest here that, while reverse engineering might be a useful strategy for figuring out how an existing technology works, it is less useful for telling us how it came to work that way. Because reverse engineering starts from a finished technical object, it misses the accidents that happened along the way — the abandoned paths, the unusual stories behind features that made it to release, moments of interpretation, arbitrary choice, and failure. Decisions that seemed rather uncertain and subjective as they were being made come to appear necessary in retrospect. Engineering looks a lot different in reverse.
But along the way, he makes an insightful observation about the very idea of structuralism as applied to algorithm design: namely, the idea that by breaking down the parts of the algorithm and understanding the pieces and how they fit together we can "understand" the algorithm at some higher level.
When you break an object down into its parts and put it back together again, you have not simply copied it — you’ve made something new. A movie’s set of microtags, no matter how fine-grained, is not the same thing as the movie. It is, as Barthes writes, a “directed, interested simulacrum” of the movie, a re-creation made with particular goals in mind. If you had different goals — different ideas about what the significant parts of movies were, different imagined use-cases — you might decompose differently. There is more than one way to tear apart content. (emphasis added)
In other words, the value of a reductionist analysis of a piece of work is not just in understanding the parts and how they fit, but in understanding the choices that led to that particular decomposition.

I think there are important lessons here for anyone involved in the creation and evaluation of new work. In particular:

Reading: While this is mostly advice for students, it applies whenever you're reading unfamiliar material. The particular form of the paper -- how the proofs are structured, how the system is built, or how the algorithm components work -- is a choice made by the authors and should not be left unexamined or unquestioned. All too often a student will read a paper as a factual news report, rather than reading it as a specific interpretation of a problem/algorithm/theorem that could lend itself to multiple interpretations (just a few days ago I was discussing some JL results with a colleague and realized that we had totally distinct and valid interpretations of some recent work in the area).

Reviewing: It's very easy (and common) to read a paper, understand the proofs, and then be completely underwhelmed by it: the dreaded "too simple" feeling that leads papers to get rejected from unnamed theory conferences. This is especially true when you're not familiar with an area, and don't realize that it was the choices the author(s) made that made the paper look simple. And so your assessment has to factor that choice in, rather than taking it for granted.

Writing/Presenting: Of course all of this impacts how you as a creator choose to present your work. There is not one way to tell a story (in writing or in a talk). But you do make choices about how to tell it. And so it's important to make those choices visible (either by explaining why they are needed, or why different choices don't work, or how they get around certain roadblocks) so that your work can receive a fair evaluation.

This can be excruciating, especially when (like many good papers) your work admits multiple interpretations, and you have to gamble on the right one to please the fickle and time-stretched reviewer. But be mindful of these choices in your work.

p.s On a separate note, it's intriguing to me how so many tools from the study of literature (narrative, contextual analysis, critical reading) show up in other forms of writing far removed from the humanities. This is yet another reason why I'm saddened (even though I'm on the "right" team) by the relentless focus on STEM in opposition to the liberal arts. It's a particularly virulent form of divide-and-conquer (the British colonial technique, not the algorithmic paradigm).

p.p.s Has it really been three months since I wrote anything ? I must be working !!