(tl;dr a bucket load of trends/papers that I found interesting at the conference)
I just returned from NIPS in Lake Tahoe. For those not in the know, NIPS is one of the premier machine learning conferences, and has been growing rapidly over the last few years. This year's incarnation (the first of at least 7 in Lake Tahoe) has over 1600 attendees, a gazillion workshops, 4 tutorials, and more papers each DAY than all of STOC.
The format of the conference is very interesting. There are keynotes each morning and afternoon, a few selected 20 minute talks, and a few more more 5 minute "spotlight" talks, all in single session mode. All accepted papers get a poster slot one of of four days, and the poster session is a mammoth event from 7pm to midnight each night (yes, you read that correctly).
I'll say more about the format in a later post. For the next few posts, I'll highlight some of the papers and trends that were interesting to me. For more conference blogging, check out Anand Sarwate's sequence (I, II), and Hal Daumé's recap.
Perhaps the most obvious trend at the conference was on deep learning. While it's been in the news recently because of the Google untrained search for youtube cats, the methods of deep learning (basically neural nets without lots of back propagation) have been growing in popularity over a long while, and I was awash in autoencoders, pooling, and other jargon from the area. It was amusing to see speakers apologize for NOT talking about deep learning.
Stéphane Mallat's keynote on this topic was a thing of beauty. While I understood very little of the technical details, the central idea was that by using certain convolution-based encodings, and looking for "invariant substructures", you could essentially filter out irrelevant noise in the feature space, generating new features for the next level of the learning network. This simple idea is quite powerful: there was a paper on learning faces from videos from Coates, Karpathy and Ng that used a simple version of this idea by using k-means with lots of clusters to identify these low-D subspaces and factor them out.
Another trend that's been around for a while, but was striking to me, was the detailed study of optimization methods. Optimization is a major theme in machine learning. This is primarily because so many machine learning problems can be formulated as convex/nonconvex optimizations. Solving these problems takes a lot of ingenuity and engineering, and after you've been swimming in a pond of regularization, sparsity, mixed norms, stochastic gradient descent, and alternating optimization for a while, things can get crazy. There are at least two different workshops on optimization in machine learning (DISC and OPT), and numerous papers that very carefully examined the structure of optimizations to squeeze out empirical improvements.
If I had to mount a critique of this large body of work, it would only be that a lot of this appears to be black magic (a criticism that seems even more true for deep learning). There are number of tricks in the optimization toolbox, and you can always try any number of them, but it's not always clear which methods work best, and why.
Quick hits on papers that interested me:
Bregman divergences: A paper by Iyer and Bilmes show how to extend Bregman divergences to the case when the generating function $\phi$ is not convex, but sub modular. This is a little tricky, because in such cases there's no unique gradient, and so technically the Bregman divergence is a function of both the function and its subdifferential. One interesting variant is the Lovasz Bregman divergence, which comes from using the Lovasz extension of a submodular function as the convex generator for a Bregman divergence. An application of these constructions comes in ranking, and it's interesting that things like the Hamming distance can be constructed.
On a separate note, Satoru Fujishige gave a wonderful invited lecture at DISC on submodular functions. I particularly liked the geometric interpretation of submodularity via its polymatroid and how a matroid can be viewed as the object you get when the submodular function is monotone. Of course this has been known for over 40 years, but I "got it" for the first time. If his book is anything like his talk, I really want to get it.
Kernel distances: Ever since I discovered the kernel distance (as in, found it, not invented it) I've been fascinated by how it behaves more or less like the earth mover distance, but is so much easier to compute. Scott Aaronson (at his NIPS invited talk) made this joke about how nature loves $\ell_2$. The kernel distance is "essentially" the $\ell_2$ variant of EMD (which makes so many things easier).
There's been a series of papers by Sriperumbudur et al. on this topic, and in a series of works they have shown that (a) the kernel distance captures the notion of "distance covariance" that has become popular in statistics as a way of testing independence of distributions. (b) as an estimator of distance between distributions, the kernel distance has more efficient estimators than (say) the EMD because its estimator can be computed in closed form instead of needing an algorithm that solves a transportation problem and (c ) the kernel that optimizes the efficient of the two-sample estimator can also be determined (the NIPS paper).
Metrics for PSD: In my continuing quest to find strange geometries to design algorithms for, I had this paper some time ago on doing geometry on the Riemannian manifold of positive definite matrices. It's a tricky business, and life gets sad quickly when you can't describe the convex hull of three points finitely.
Suvrit Sra proposed a new metric for the space of $n \times n$ positive definite matrices. It's what you get if you apply the Jensen-Shannon construction to the Burg entropy on matrices. It has the property of being more easily computable than the standard Riemannian metric, and also shares with that metric a nice closed form for the midpoint of two matrices. What remains open is the kind of geometry this metric induces: a weird property is that the associated kernel exp(-\gamma d^2) is p.d iff $\gamma$ is half integral below $n-1$, and any real above it.
Distance metric learning and MKL.
Our presentation at OPT was about multiple kernel learning, which is closely related to the general problem of distance metric learning (especially when the distance metric is defined by a kernel). There were a good number of distance metric learning at NIPS - different approaches that either did local learning of "multi-metrics" or learned a metric based on k-NN points, or even Hamming distance metric learning.
Distributed learning (in a communication-restricted environment) seems to be a growing concern, which of course makes me happy. There were a few papers on different kinds of communication-limited learning, including one on doing this for probabilistic PCA (which didn't have formal bounds on the amount of communication though) and one on distributed "non-stochastic" experts.