Tuesday, April 22, 2014

The Shape of Information

A brief synopsis of my general-audience talk at Haverford College. 

I'm currently visiting Haverford College at the invitation of +Sorelle Friedler as part of Haverford's big data lecture series. Today's talk was a general audience talk about data mining, titled 'The Shape Of Information': (GDrive link)
The Shape Of Information 
What makes data mining so powerful, and so ubiquitous? How can the same set of techniques identify patients at risk for a rare genetic disorder, consumers most likely to like Beyonce's latest album, or even a new star from an sky survey ?  
The answer starts with an idea Descartes had nearly 500 years ago. He suggested expressing geometry in terms of numbers (coordinates). This turned out to be a powerful technique that led (among other things) to the development of the calculus. Data mining returns the favor. It starts with sets of numbers that describe a collection of objects. To find patterns in these objects, we create a geometry in which the numbers are coordinates. And just like that, objects become shapes, and the search for information becomes a quest for common structure in these shapes. 
In this search, we are not limited by the geometry of our world: we can dream up ever more intricate geometries that capture the shape of the information that we seek to find in our data. In this sense, data mining is the best kind of science fiction come to life: we craft a world out of our imagination, and let the laws of this world lead us to fascinating discoveries about the data that inhabits it.
I had a great time visiting with Sorelle's students in their data mining class. Haverford College continues to impress me with the quality of their undergrads (and their faculty !)

Friday, April 18, 2014

danah boyd, Randall Munro, and netizens.

danah boyd, author of 'It's Complicated' just gave a tech talk at Google. Her book has been in the news a lot lately, so I'll skip the details (although Facebook ought to be at least slightly worried).

But what I enjoyed the most about her talk was the feeling that I was listening to a true netizen: someone who lives and breathes on the internet, understands (and has helped build) modern technology extremely well (she is a computer scientist as well as an ethnographer), and is able to deliver a subtle and nuanced perspective on the role and use of technology amidst all the technobabble (I'm looking at you, BIG data) that inundates us.

And she delivers a message that's original and "nontrivial". Both about how teens use and interact with social media, and about how we as a society process technological trends and put them in context of our lives. Her discussion of context collapse was enlightening: apart from explaining why weddings are such fraught experiences (better with alcohol!) it helped me understand incidences of cognitive frisson in my own interactions.

What she shares with Randall Munro in my mind is the ability to speak unselfconsciously and natively in a way that rings true for those of us who inhabit the world of tech, and yet articulate things that we might have felt, but are unable to put into words ourselves. Of course they're wildly different in so many other ways, but in this respect they are like ambassadors of the new world we live in.

Thursday, April 17, 2014

STOC 2014 announcement.

Howard Karloff writes in to remind everyone that the STOC 2014 early registration deadline is coming up soon (Apr 30 !). Please make sure to register early and often (ok maybe not the last part). There will be tutorials ! workshops ! posters ! papers ! and an off-off-Broadway production of Let It Go, a tragicomic musical about Dick Lipton's doomed effort to stop working on proving P = NP.

At least a constant fraction of the above statements are true.

And if you are still unconvinced, here's a picture of Columbia University, where the workshops and tutorials will take place:

Monday, April 07, 2014

Directed isoperimetry and Bregman divergences.

A new cell probe lower bound for Bregman near neighbor search via directed hypercontractivity. 

Amirali Abdullah and I have been pondering Bregman divergences for a while now. You can read all about them in my previous post on the topic. While they generalize Euclidean geometry in some nice ways, they are also quite nasty. The most important nastiness that they exhibit is asymmetry:
\[ D_\phi(x, y) \ne D_\phi(y, x)\]
What's worse is that this asymmetry can grow without bound. In particular, we can quantify the degree of asymmetry by the parameter $\mu$:
\[ \mu = \max_{x,y \in \Delta} \frac{D_\phi(x,y)}{D_\phi(y,x)} \]
where $\Delta$ is the domain of interest.

There's been a ton of work on clustering Bregman divergences, and our work on low-dimensional approximate nearest neighbor search. In almost all these results, $\mu$ shows up in the various resources (space and/or time), and it seems hard to get rid of it without making other assumptions.

After our low-D ANN result, we started trying to work on the high-dimensional version, and our thoughts turned to locality-sensitive hashing. After struggling for a while to get something that might be useful, we started pondering lower bounds. In particular, it seemed clear that the worse $\mu$ was, the harder it became to design an algorithm. So could we actually come up with a lower bound that depended on $\mu$ ?

Our first result was via a reduction from the Hamming cube (and $\ell_1$ near neighbors). While these gave us the desired lower bound, it wasn't $\mu$-sensitive. So we went looking for something stronger.

And that's where isoperimetry made an entrance. There's been a long line of results that prove lower bounds for LSH-like data structures for near neighbor search. Roughly speaking, they work in the following manner:

  1. Construct a gap distribution: a distribution over inputs and queries such that a query point is very close to its nearest neighbor and very far from any other point in the space. In the $\ell_1$ case, what you want is something like $\epsilon d$ distance to the nearest neighbor, and $\Omega(d)$ distance to the second nearest neighbor. 
  2. Imagine your data structure to be a function over the Hamming cube (assigning blocks of points to cells) and look at what happens when you perturb it (formally, by defining the function value to be its average over all nearby points)
  3. Use isoperimetry (or hypercontractivity) to argue that the function "expands" a lot. What this implies is that points get scattered everywhere, and so any particular cell you probe doesn't have enough information to determine where the true nearest neighbor actually is. This last step can be proved in different ways, either using Fourier methods, or via the use of Poincaré-type inequalities.

Our initial hope was to use this framework to prove our bound, while incorporating terms relating to the asymmetry of Bregman divergences more directly.

This turned out to be hard. The asymmetry destroys many of the nice algebraic properties associated with the noise operator and related inner products. There's even a deeper problem: if you think of an asymmetric noise operator as pushing things "forward" on a cube with one probability and backwards with another, then it's not hard to imagine a scenario where applying it actually ends up concentrating the mass of the function in a smaller region (which would break hypercontractivity).

We needed two things: a way to get around this not-so-small issue that hypercontractivity isn't true in a directed setting, and a way to analyze the asymmetric noise operator. It turned out that we could circumvent the first problem: we could restrict the class of hash functions we considered in a way that didn't significantly change their entropy (only a constant). Alternatively, we could view hypercontractivity as being true "on average".

The second problem was a bit harder. The solution we came up with was to try and link the directed noise operator to a symmetric operator on a biased space. The intuition here was that the directed operator was creating a nonuniform distribution, so maybe starting off with one might give us what we need.

This actually worked ! There are versions of the Bonami-Beckner inequality in biased spaces that look almost just like their counterparts in uniform space (you merely set the probability of a bit being 1 to $p \ne 1/2$, and the Fourier coefficients are defined in terms of $p$).

There was lots more to address even after the isoperimetry result, but you'll have to read the paper for that :).

I should note that in many ways, this was a "Simons baby": Amir visited during one of the workshops for the program on real analysis, and we had fruitful conversations with Elchanan Mossel and Nathan Keller in early stages of this work.                                                                                                               

Thursday, March 20, 2014

Data Mining, machine learning and statistics.

How does one tell data mining, machine learning and statistics apart ?

If you spend enough time wandering the increasingly crowded landscape of Big Data-istan, you'll come across the warring tribes of Datamine, MachLearn and Stat, whose constant bickering will make you think fondly of the People's front of Judea:

Cosma Shalizi has what I think is a useful delineation of the three tribes that isn't prejudicial to any of them ("Stats is just inefficient learning !", "MachLearn is just the reinvention of statistics!" "DataMine is a series of hacks!"). It goes something like this:

  • Data mining is the art of finding patterns in data.
  • Statistics is the mathematical science associated with drawing reliable inferences from noisy data
  • Machine learning is [the branch of computer science] that develops technology for automated inference (his original characterization was as a branch of engineering).
I like this characterization because it emphasizes the different focus: data mining is driven by applications, machine learning by algorithms, and statistics by mathematical foundations. 

This is not to say that the foci don't overlap: there's a lot of algorithm design in data mining and plenty of mathematics in ML. And of course applied stats work is an art as much as a science. 

But the primary driving force is captured well. 

Wednesday, March 19, 2014

Reproducibility in data mining/learning

Is reproducibility in data mining/learning different from reproducibility in systems research ?

+John Regehr and +Shriram Krishnamurthi have discussions (John, Shriram) arising from a new paper on  reproducibility in computer systems research from people at the U. of Arizona. You should read their posts/G+ discussions for the specifics.

But it got me thinking: is there anything intrinsically different about reproducibility in my neck of the woods ? Many concerns are shared:

  • The need for a standard computing environment to build and test code: MATLAB and scikit-learn make that a lot easier for us: we don't need to worry too much about details of the computing environment/hardware just to get the code to run. 
  • Discrepancies between what is stated in the paper and what the code actually does: that's a problem with any experimental work: John in fact mentions one such issue in one of his recent papers.
  • Code rot: otherwise known as 'graduation' :). I think if more people used public repos like github from the beginning, some of these problem might go away. I would be remiss if I didn't also plug +Robert Ricci's new system AptLab for helping distribute research.
But here's one that may not be shared: data preparation.

It's no secret that data preparation takes most of the time in a data mining pipeline. This includes
  • data cleaning (removing errors, reformatting data)
  • feature engineering (deciding how to truncate, modify or retain certain features)
  • training (cross-validation levels, what data is used for model building and so on)
As with any of the above, a proper specification can ensure true reproducibility, but there are lots of things we might do to data without necessarily even realizing it (if there are bugs in the transformation pipeline) or without thinking we need to mention it (truncating ranges, defining null values away). 

Feature engineering can also be a problem, especially with models that use many features (for example, deep learning systems have lots of "knobs" to tune, and they can be quite sensitive to the knobs).

So one thing I often look for when reviewing such papers is sensitivity: how well can the authors demonstrate robustness with respect to the parameter/algorithm choices. If they can, then I feel much more confident that the result is real and is not just an artifact of a random collection of knob settings combined with twirling around and around holding one's nose and scratching one's ear. 

Tuesday, March 11, 2014

A short not-review of the first episode of Cosmos

If you've been living on a rogue planet wandering through the far reaches of the Local group, you may not have heard of the Cosmos reboot on FOX with Neil deGrasse Tyson.

We finally got around to seeing the first episode yesterday. I was nervous because I loved the original so much and hoped that my kids would like it as well (as all parents know, you can't show fear or the sharks will smell it :)).

Things I liked:

  • The homage to Carl Sagan was done beautifully. If you're going to reboot a franchise, you don't have to pretend that nothing existed before, you know. 
  • The mix of CG + real person and animation was a good way to tell the different stories. I was a little worried about the denouement of the Giordano Bruno tale because my six-year old was watching, but it was done without being too graphic. 
  • The graphics were non-cheesy: but then I wasn't ever worried about the quality of graphics in the original. I'm pretty sure that 20 years from now these graphics will look cheesy as well. The year-long calendar of the universe was a fantastic demonstration of the immensity of time. 
Things I could live without:

Granted, it's the first episode, so there's no point in being too harsh. But 
  • If I had any heartstrings ever, they've been pulled out of shape into long strands of spaghetti. The constant sweeping orchestral background made me feel like I was watching NBC's coverage of the Winter Olympics. I almost expected to see a personal profile of how the universe overcame the adversity of the Big Bang to become the star performer it is today. 
  • As with all modern American shows purporting to be about science, I found the balance between sweeping rhetoric and actual facts to be disturbingly skewed towards the soaring fluffy. Watch some David Attenborough, people ! or even, I don't know, COSMOS itself. But see below...
I liked it. No question that I'm watching it again. And the eight-year old loved it ! Thought it was full of information. So maybe my assessment of graphics-to-information ratio is completely off. 

The funniest thing: when it was over, this happened:
Kids: "Let's see the next episode now!"
Me: We'll have to wait a week for it.
Kids: "What do you mean, we have to wait a week ?"
Netflix should be proud. Binge watching on demand is now the default mode. 

Disqus for The Geomblog