My student Parasaran Raman and I are at the SIAM Conference on Data Mining (SDM) 2011, where he's presenting this paper on "spatially aware" consensus. In between preparation for his very first conference talk (yay!) he had enough time to jot down his thoughts on day 1. As usual, I've done light editing for formatting, and take responsibility for all errors in transmission. Note that SIAM hasn't yet placed papers online, and so when available I'll add links.
Collecting real data has always been a real problem for me. All the experiments with synthetically generated data and shouting "bingo!" after getting good results on them for the meta-clustering problems that I work on, was getting too boring. SDM11 had a tutorial on "Data Mining for Healthcare management" and I was interested to see what kind of healthcare-related data I can get to play with. Especially because almost all of the papers that uses consensus clustering techniques on real data have been motivated by analysis of either microarray gene expression or protien sequences. Prasanna Desikan of Allina Hospitals and Clinics talked about how terabytes of data generated by multiple sources drives the need for efficient knowledge based systems. It was heartening to see many publicly available datasets that require mining. The key challenges from a strictly CS perspective are the usual (1) Dealing with missing data/attributes and (2) Outliers and noisy data which are even more pronounced in the healthcare sector.
I also sat through a few talks in the classification session. Byron Wallace talked about letting multiple experts involve in active learning and "who should label what?". The talk focussed on how a single infallible oracle is impractical and stressed the need for multiple experts. The key idea was to save the hard questions that are usually not many for the experienced labelers who are expensive and let the inexperienced labelers do the load of work when the questions are rather easy. An interesting question that came up was about the feasibility of partitioning the question set into easy and hard.
Sanjay Chawla talked about supervised learning for skewed data where the classes of different sizes causes the hyperplane separating the classes to be pushed towards the class with lesser points (ed: the imbalanced classification problem). The solution provided was to use a quadratic mean loss instead of just the arithmetic mean. The loss function is convex and can be solved by usual convex solvers.
Liang Sun spoke about a boosting framework for to learn the optimal combination of individual base kernel classifiers. The algorithm goes through multiple boosting trials. During each trial, one classifier for each kernel is learned. The 'multi-kernel' classifier for each stage in the boosting process is a function of the input classifiers. Their choice is to pick the best one. The boosing process itself is otherwise similar to AdaBoost.
There was also a lightning session and true to its name, there were 25 talks in 75 minutes! I could relate to a meta-clustering work by Yi Zhang and Tao Li where the goal was to find a k-consensus solution from a bunch of input partitions of the same data. The method works in four steps: (1) generate multiple input partitions, (2) use Mallows distance (Zhou et. al.) to compute distance between partitions, (3) build a hierarchy over the partitions and make a cut to get a 'clustering' of partitions and (4) generate a consensus partition on each 'cluster' from the cut. I have found it difficult to find data that admits multiple partitions that are good but different from each other at the same time, and that make me ask the question "How do you generate the different partitions?". The answer sadly was "We run kmeans 38 times!". I am still looking for a motivation for multiple alternate partitions.
The reception had just cheese and a few veggies, but I have been having some good food here in Phoenix and I am not complaining.