Sunday, May 01, 2011

SDM 2011: Other thoughts

My student Parasaran Raman did a two part review of SDM 2011 (here, and here), and I thought I'd add my own reflections.

There was an excellent talk on the second day by Haesun Park from Georgia Tech on non-negative matrix factorization methods (NMF). This problem has become all the rage in ML circles, and is defined as follows.
Given a matrix $A$ and parameter $k$, find rank $k$ matrices $U$ and $V$ such that $\|A - UV\|$ is minimized, and $U$ and $V$ contain only nonnegative entries.
The main difference between this problem and the SVD is the non-negativity requirement, and not surprisingly, it changes the complexity quite a bit - you can no longer get an optimal solution, for example.

There appears to be relatively little theory-algorithms work on this problem (there's some mathematical work, including the famous Perron-Frobenius theorem), and her talk presented a good overview of the various heuristic strategies people have used to try and solve the problem.

One of the reasons this formulation is interesting is because for many problems, you'd like an "explanation" of the data that doesn't have negative coefficients (which is the bane of PCA-like methods). She also says that for reasons unknown, the matrices produced by this formulation tend to be quite useful "in practice" at factoring out the different interactions present in data. In fact one of her big open questions is whether there's a more rigorous way of explaining this.

The talk was very enjoyable, and I sat there thinking that it would be perfect as an ALENEX (or even SODA) invited talk.

There was also a panel discussion on the future of data mining. The panelists comprised two industry folks, two academics, and one researcher from a national lab, with questions being thrown at them by Chandrika Kamath from LLNL. My twitter stream gave a blow-by-blow, so I won't rehash it here. I was intrigued (and a little disappointed) when I realized that almost all the answers centered around the needs of business.

In one respect this is not surprising: the primary reason why data mining is so hot right now is because of the vast opportunities for data mining to sell ads, products, or even just model consumer behavior. But it makes the field a little more shallow as a consequence: inevitably, industry needs solutions, not deep problems, and a relentless focus on solutions doesn't help facilitate deeper analysis of the problems. Maybe that's the nature of data mining, and I'm expecting too much to ask it to be more profound. But I wouldn't mind a more ML-like sense of rigor in the formulation of computational questions.

Overall, I quite enjoyed the conference. I've  been to KDD, and have been on PCs for both KDD and ICDM (the third major data mining conference). To me, SDM has the feel of a SODA/SoCG like conference - a smaller, more academic crowd, more mathematics/statistics in the talks, and less of a big industrial presence. I can definitely see myself publishing there and going back again.

No comments:

Post a Comment

Disqus for The Geomblog