Saturday, December 15, 2012

NIPS II: Deep Learning and the evolution of data models

(tl;dr: some rambles and musings on deep learning and data, as I attempt to sort out in my head what this all means)

Over the years, as we've engaged with "big data" more and more, the way we construct mental models of data has changed. And as I've argued before, understanding how we think about data, and what shape we give it, is key to the whole enterprise of finding patterns in data.

The model that one always starts with is Euclidean space. Data = points, features = dimensions, and so on. And as a first approximation of a data model, it isn't terrible.

There are many ways to modify this space. You can replace the $\ell_2$ norm by $\ell_1$. You can normalize the points (again with $\ell_2$ or $\ell_1$, sending you to the sphere or the simplex). You can weight the dimensions, or even do a wholesale scale-rotation.

But that's not all. Kernels take this to another level. You can encode weak nonlinearity in the data by assuming that it's flat once you lift it. In a sense, this is still an $\ell_2$ space, but a larger class of spaces that you can work with. The entire SVM enterprise was founded on this principle.

But that's not all either. The curse of dimensionality means that it's difficult to find patterns in such high dimensional data. Arguably, "real data" is in fact NOT high dimensional, or is not generated by a process with many parameters, and so sparsity-focused methods like compressed sensing start playing a role.

But it gets even more interesting. Maybe the data is low-dimensional, but doesn't actually lie in a subspace. This gets you into manifold learning and variants: the data lies on a low-dimensional curved sheet of some kind, and you need to learn on that space.

While the challenge for geometry (and algorithms) is to keep up with the new data models, the challenge for data analysts is to design data models that are realistic and workable.

So what does this have to do with deep learning ?

Deep learning networks "work" in that they appear to be able to identify interesting semantic structures in data that can be quite noisy. But to me it's not entirely clear why that is. So I've been mulling the question of what kind of structure in data might be "visible" to such a network. In the absence of formal results ("if your data can be separated in some RKHS, then an SVM will find it"), what follows is some speculation, based on talks I attended and conversations I had with NIPs attendees.

Observation I: Both StephanÃ© Mallat's talk and a nice paper by Coates, Karpathy and Ng talk about the idea of "first-level" features that identify (low-dimensional) invariants and eliminate noise. In the Coates et al paper, they start with a very fine $k$-means clustering ($k$ very large), and attempt to "glue" the centers together into low dimensional subspace pieces. These are then propagated up a network of higher-order feature processors.

Observation 2: Yoshua Bengio's survey of deep learning (a very readable account) points to work by Geoff Hinton as part of the reinvigoration of the field. A central idea of this work is that deep belief networks can be trained "layer by layer", where each layer uses features identified from the previous layer.

If you stare at these things long enough, you begin to see a picture not of sparse data, or low-rank data, or even manifold data. What you see is a certain hierarchical collection of subspaces, where low-dimensional spaces interact in a low dimensional way to form higher level spaces, and so on. So you might have a low-level "lip" feature described by a collection of 2-3 dimensional noisy subspaces in an image space. These "lip" features in turn combine with "eye" features and so on.

This might seem rather far fetched, and a strange way to think about data. But I can't claim originality even here. Back in June, Mauro Maggioni gave a talk at CGWeek in Chris Bishop's workshop on the connection between analysis and geometry. In this talk, he described a multi-resolution view on data that admits structure at different scales, and admits different structures at these scales.

The upshot of all of this: it is possible that deep learning is trying to capture hierarchies of low dimensional subspaces that interact in a low dimensional way. This would explain how one is able to avoid the curse of dimensionality, and might also explain why it sometimes can find structure that other methods (kernels, manifold methods, etc) might miss.

Problem is: I'm not sure how one tests this hypothesis. Almost any data set could be viewed this way if you allow enough features and enough "depth" in the hierarchy.