## Thursday, July 23, 2009

### Clustering: Hierarchical methods

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

k-center, k-median, k-means, k-medoids, .... The list is endless, but that pernicious k comes up everywhere. What can we do about it ? There are really two ways to go:
1. Figure out the "right" k for a problem. This is a complicated matter, and will be the topic of a later post
2. Don't choose: give the user a universal representation from which they can decide what k they want.

This latter formulation takes us into the world of hierarchical clustering, the topic of this post.

There are two different ways to think about hierarchical clustering: a representational view and an algorithmic view. The algorithmic view is quite simple: let's try to design a clustering via greedy operations. In the top-down view, we find a good split of the data into two parts, and then recurse on each side. In the bottom-up view, we select two clusters for merging (in the beginning, all items are in separate clusters), and merge our way up.

Perhaps the most well known of the bottom-up methods is the 'single link' clustering algorithm, in which at each step, you merge the two clusters that are closest to each other. If this sounds familiar, it should: this is merely Kruskal's MST algorithm run partially to completion.

The "single-link" method is optimistic: it creates clusters via connected components, assuming that transitivity is enough to construct a cluster. The "complete-link" approach is more pessimistic: rather than defining the distance between two clusters as their nearest pair distance, it defines it as the furthest pair distance, ensuring a clique-like structure for clusters. Merging happens as before.

Other variants that are more robust to noise average the pairwise distances to obtain the clustering distance. With the right choice of distance, these amount to comparing the centroids of clusters.

You'll notice that I didn't actually define a problem that these algorithms solve, keeping in with the grand tradition of clustering :). There is a way of defining a general optimization function that single-link clustering solves optimally. For the others though, although we can define a cost function, we can't show that they're optimal.

So much for the algorithmic view of HC. The representational view goes somewhat deeper.

I've had at least one person (you know who you are) tell me that one of the only two clustering algorithms that worked for them is hierarchical clustering. And indeed, the idea is very seductive. Rather than present the user with a single k-clustering - a snapshot, if you will, of the data - we give them a tree of merges, in which there is one leaf for each object. The central idea here, and one that bears emphasizing, beacuse it's so different to how we've thought about clustering thus far, is this:

we understand the structure of data from its transitions, rather than from its states.

What I mean is this: rather than defining a clustering as a static partitioning of the data, we watch the data evolve'' as we start merging points together, and use the merge history to figure out what the real structure is. There are a couple of ways of doing this: the most common way is to imagine drawing the tree from top to bottom, and then cutting it by a y-monotone curve (one that intersects any vertical line in exactly one point). This can be done to ensure we have k clusters, or it can be done at points where the merge appears to be stable'' (the subtree doesn't merge with any other subtrees for a long time).

A particularly effective visual metaphor for this is the dendrogram, which is very popular in evolutionary tree analysis.

The dendogram is visually appealing because it does two things: firstly, it depicts the tree of merges, permuting the nodes so that there aren't crossings. Secondly, and more importantly, it uses the lengths of edges as a visual marker to indicate the 'time of merge' for clusters. If we think of starting at time 0 and merging clusters, then a cluster that sticks around for a long time will have a tall edge connecting it to its parent, in comparison with a cluster that gets merged quickly. (as an aside, visual metaphors are possibly underrated when it comes to thinking about clustering algorithms. My firm belief is that one of the reasons soft clusterings aren't used as much as hard clusterings is because it's hard to visualize them)

Sidebar:
And this is the key operational idea behind clustering from the transitions'': clusters that stay unmerged longer are likely to be more interesting than clusters that get merged quickly. Till recently, this was merely a heuristic way of thinking about hierarchical clusterings. However, we now have the idea of persistence that comes from the realm of computational topology. In short, persistence is a way of quantifying topological features of a shape that persist across different scales. Persistence has been studied extensively as a rigorous way of quantifying the triviality (or non-triviality) of features in a shape, and there's a recent paper that applies persistence-like concepts to clustering as well. It's still early days for this direction, but I think it's a promising one.

Returning to hierarchical clustering, one major problem with this approach is that it's local: make the wrong choice of merge early on, and you'll never get to the optimal solution for a k-clustering problem. And since there's no easy way to change your mind (i.e split a clustering), it's hard to reverse bad decisions. If you're so inclined (and I am nowadays), this is the problem of finding monotone paths in the merge-split lattice on partitions of [1..n], but I digress...

Ultimately, the value of hierarchical clustering comes in its ability to represent an entire history of clusterings, rather than just one, and the relative ease with which a bottom-up algorithm can be written. That, coupled with its uses where you really do want to find a tree (evolutionary or otherwise) is what makes it a popular implement in the clustering toolbag.

## Wednesday, July 15, 2009

### Consistent BibTeX formatting

I try not to write BibTeX by hand any more: too easy to introduce errors. So I usually use either DBLP or the ACM digital library to get BibTeX for papers. Sometimes the journal has BibTeX, or some format that can be converted. As an aside, IEEE is extremely lame: you have to login to their digital library even to get a citation !

For the most part, I don't need to go beyond ACM or DBLP, which is great. But here's the problem: their formats are different ! I needed the BibTeX for a recent paper of mine, and found it on both sites. Here's what ACM gave me:
@inproceedings{1516372,
author = {Ahmadi, Babak and Hadjieleftheriou, Marios and Seidl, Thomas and Srivastava, Divesh and Venkatasubramanian, Suresh},
title = {Type-based categorization of relational attributes},
booktitle = {EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology},
year = {2009},
isbn = {978-1-60558-422-5},
pages = {84--95},
location = {Saint Petersburg, Russia},
doi = {http://doi.acm.org/10.1145/1516360.1516372},
publisher = {ACM},
address = {New York, NY, USA},
}

and here's what DBLP gave me:
Thomas Seidl and
Divesh Srivastava and
Suresh Venkatasubramanian},
title = {Type-based categorization of relational attributes},
booktitle = {EDBT},
year = {2009},
pages = {84-95},
ee = {http://doi.acm.org/10.1145/1516360.1516372},
crossref = {DBLP:conf/edbt/2009},
bibsource = {DBLP, http://dblp.uni-trier.de}
}

@proceedings{DBLP:conf/edbt/2009,
editor = {Martin L. Kersten and
Boris Novikov and
Jens Teubner and
Stefan Manegold},
title = {EDBT 2009, 12th International Conference on Extending Database
Technology, Saint Petersburg, Russia, March 24-26, 2009,
Proceedings},
booktitle = {EDBT},
publisher = {ACM},
series = {ACM International Conference Proceeding Series},
volume = {360},
year = {2009},
isbn = {978-1-60558-422-5},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
So as you can see, we have a problem. The formats are not consistent, which means that if I need to get some references from DBLP, and others from the ACM, my references file is going to look very irregular.

Other critiques:
• I have never understood why DBLP splits up the conference and the paper: with BibTeX, if you cite three or more papers that use the same crossref, the crossref is included itself as a reference, which is just strange.
• Unless you use double curly braces, capitalizations inside a string get removed, which is mucho annoying: It's "Riemannian", not "riemannian".
• The DBLP name for the conference is too cryptic: who'd even know what EDBT is outside the database community. On the other hand, the ACM citation is clunky, and is a page-length disaster waiting to happen.
Thoughts ?

## Thursday, July 09, 2009

### Quick note

(am posting this from 37000 ft. isn't technology cool ?)

The clustering series will be on hiatus for a few days while I wrap up some more pressing deadlines. It will restart two weeks from now.

Congratulations to Adam Smith and Sean Hallgren on getting a PECASE award. See what creating a new blog can do !

### NSF EDA Workshop: Wrap up

Today morning was wrap up day. The theory group was "tasked" with coming up with a few slides suggested avenues for further cooperation. Dick talked about creating (or re-creating) an environment in academia where researchers are building chips and talking to other experts down the corridor (like theoreticians) rather than the 'planted' model where a theoretician is planted in a corporate design environment for 6 months or so.

I talked briefly about what I called "new" theory: the twin ideas of randomization and approximation that have been so powerful in algorithm design (even for intractable prolems), and how these play into the problems of dealing with massive high dimensional data.

Vijaya gave a synopsis of cache-obliviousness, multicore research, and her recent work in this area. There's a lot of interest in EDA and multicore work, and she made the argument that theoreticians are learning how to design efficient multicore algorithms and can be of great help in adapting EDA methods.

There were also four sub-panels that addressed various aspects of EDA. What to me was interesting that many of the panels brought up "EDA needs in theory" that matched some of the things we talked about. For example, there's a pressing need for methods that run linear (and even sublinear) time, tools that are incremental, in that they can progressively generate better and better solutions given more time, and tools that can parallelize well.

They also talked about providing benchmarks: for example, a 10,000 variable SAT instance and code that solves it. I thought (and said) that this would be an excellent way to get people dabbling in this area.

Randomization was a trickier matter. Although the EDA folks recognize the power of randomization, they are concerned about reproducibility. The DA pipelne is long and involved, and once you've fixed the output of a piece of the pipeline, you'd like to keep it in place and not change from iteration to iteration.

Of course this is something that can be fixed with seed control. For more portability, it's conceivable that you can cache the random coin tosses once you've settled on a reasonable solution to any piece in the pipeline.

Overall, it was quite interesting, although exhausting. The general idea with such workshops is that the findings make their way into the next call for proposals (sometime in October/November), so if you have EDA people you'd like to collaborate it, this might be a good opportunity.

It's time to go home.

### NSF EDA Workshop: A comment

Paul Beame had a brilliant comment on my post about the EDA Workshop, and I liked it so much I'm reproducing it in full here:

One fault is that we don't even teach the proper tools in our algorithms courses! There are relatively few algorithms texts that even support the teaching of backtracking as an algorithmic paradigm. Moreover, the sort of very nice work with provable small exponential upper bounds that some theory researchers (e.g., David Eppstein, Richard Beigel, Martin Furer) have done in backtracking algorithms for combinatorial problems is only a very small part of the landscape for algorithmic solutions to hard problems.

Too often, we in theory leave the important hard problems to those in AI and formal methods. One of the standard paradigms in practical verification is to attack problems that in their full generality are PSPACE-hard (or worse), find a useful NP subclass, apply a reduction from these problems to SAT, and use SAT solvers that involve worst-case exponential backtracking (or to a much lesser extent local search) algorithms that are fast in many cases. The key techniques used in these solvers are powerful heuristics that are usually studied in AI but rarely in algorithms research.

The potential for practical impact of improved versions of these algorithms (even with only polynomial speed-up) could be huge. Traditional algorithmic research has occasionally had something useful to say about these topics (such as Moser's recent beautiful analysis of variants of the random walk algorithm on the special classes of CNF formulas satisfying the Lovasz Local Lemma conditions) but it is somehow not viewed as part of the mainstream.

Applications in EDA are only part of the utility of some of the exponential algorithms used in formal methods. There has been a large and surprisingly successful body of work on software model checking that uses quite clever theoretical ideas. One of my favorite examples is the following: Though techniques for checking properties of communicating finite state machines (one of those nasty PSPACE-complete problems) had been in use for EDA since the early 1990's, model checking software presented an additional problem because of the need to handle the call stack. Suppose that one needs to check a safety property (i.e., an invariant). The key observation (which goes back to the early 1960's) is that the language consisting of the possible stack contents of a PDA is always a regular language for which an NFA can be easily constructed from the PDA! One can then apply a small extension the previous algorithm to check safety properties (which must hold for all possible stack contents).

## Wednesday, July 08, 2009

### NSF Workshop: Electronic Design Automation

I'm currently at an NSF Worshop on Electronic Design Automation (the thing that used to be called VLSI design). I'm here as part of the 'theory audience' along with Vijaya Ramachandran and Dick Lipton (blogger power!).

Thankfully, Dick has posted an extensive summary of the day of talks, so I don't have to. Our mandate here is to listen in on the discussions and come up with a response that suggests avenues where theory folk might have something useful to contribute.

From a purely biased algorithms perspective, one thing that strikes me about the EDA community (at least in the formal methods/verification realm) is their unwillingness to give up in the face of beyond-NP-completeness. What I mean is this: most of my training in algorithms (and this is likely true for you as well) is in the polynomial-time regime: all the algorithic paradigms we learn are effective at reducing the complexity of an algorithm from one polynomial to another.

When we engage with NP-hardness, we switch modes to approximations, and focus on the issue of quality: even the approximation algorithms themselves run in poly time. There are very few people (David Eppstein comes to mind) who work on algorithms in the exponential/subexponential realm and will worry about (say) reducing the base of the exponent for SAT or graph coloring or other hard problems.

The verification folks don't necessarily solve their (very hard) problems exactly, but they do design all kinds of tricks (and heuristics) to deal with these problems, because they actually need to solve them ! In my view, it wouldn't be a bad idea for students learning algorithms to learn at least a few tricks for designing algorithms that might run in exponential time, but are efficient. Remember that exponential might be better than n^100 for many values of n.

One thing that came to mind as I listened to talks. With the exception of a talk by Rupak Mazumdar on faulty computations, and a talk by Ed Clarke (yes, that Ed Clarke) on statistical model checking (based on the Neyman-Pearson hypothesis testing framework), there was little talk of the role that randomization might have to play in the problems of EDA.

A second thought was how the lessons of massive data analysis might be useful in the realm of DA. One speakr described one critical problem as being the degree of complexity associated with current DA tools: there are over 4000 "knobs" to turn in one such tool ! It's believed that these knobs are not independent, and might even be contradictory. If we think of each "run" of the DA tool, outputing some kind of chip layout, as a point in this 4000+ dimensional space, I wonder whether techniques for dimensionality reduction and manifold analysis might be useful to find a set of "core knobs" that control the process.

I have to say that it's nice to attend a workshop with a community that throws out terms like NP-Complete, \Sigma_2, and PSPACE so freely :).

## Friday, July 03, 2009

### FOCS 2009 results.

The FOCS list is out, (with abstracts). 73 papers accepted. Some papers that popped up on my radar screen (post more in the comments since I'll probably miss many good ones):
• Matt Gibson and Kasturi Varadarajan. Decomposing Coverings and the Planar Sensor Cover Problem.

This paper continues a line of work that I like to think we started in a SODA paper a few years ago. The problem is this: you're given a collection of sensors that monitor a region. But they have limited battery life, and they're also redundant (many different subsets of sensors have the range to cover the region). Can you schedule the sensors to go on and off so that at all times, the entire region is covered, and the region is covered for as long as possible ?

Early work on this problem formulated it in a combinatorial setting, which immediately led to things like log n-approximability lower bounds via set cover variants. Our belief was that the geometry of the regions would make things a bit easier. We made some small progress, getting a sublogarithmic approximation ratio for intervals on the line, and a constant factor for ranges with special structure. Subsequent work made steady improvements, and this work has brought things down to a constant for general classes of regions in the plane
• Saugata Basu and Thierry Zell. Polynomial hierarchy, Betti numbers and a real analogue of Toda's Theorem

Betti number computation has been a hot topic in computational geometry of late, but the idea of using Betti numbers to bound the (algebraic complexity) of basic geometric operations dates back to work by Dobkin and Lipton that relates the complexity of certain computations to the number of connected components of "invariant paths" in a computation. Ben-Or generalized these results to more complicated algebraic computation trees, and noting that "number of connected components" is the zero-th Betti number of the set of computations, asked whether higher-order Betti numbers might be useful for proving lower bounds. A series of results by Bjorner, Lovasz and Yao, Bjorner and Lovasz, and finally Yao showed that indeed the higher-order Betti numbers can be used to lower bound the complexity of algebraic computations.

So Betti numbers are important as a measure of "counting" in algebraic complexity. This paper reinforces this intuition, by relating #P over the reals to the complexity of computing Betti numbers over semi-algebraic sets. Interestingly, this paper mentions the difficult nature of Toda's original proof, and Lance just recently tweeted a new very simple proof of Toda's theorem that he just published in ToC.

• David Arthur, Bodo Manthey and Heiko Roeglin. k-Means has Polynomial Smoothed Complexity

Read my post on k-means for more on the significance of such results. This paper finally resolves the smoothed complexity issue, giving a polynomial bound (expected).

• Peyman Afshani, Jeremy Barbay and Timothy M. Chan. Instance-Optimal Geometric Algorithms

An interesting result that seems to be able to take away order-dependence from some geometric problems.

• T.S. Jayram and David Woodruff. The Data Stream Space Complexity of Cascaded Norms

Cascaded norms (where you compute one aggregate on each of a collection of streams, and compute another aggregate on these aggregates) are tricky for streaming algorithms, and this paper provides some interesting results. Again, I'm waiting for the document to show up, and I'd be interesting in seeing the kinds of techniques they use.

### Clustering: k-means...

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

k-means (also known as Lloyd's algorithm) is probably the most well known clustering algorithm, and as such, deserves some discussion of its own. We've already seen that it's part of the "I don't like you" family of clustering formulations. How does the method work ?

We start with a set of points in R^d. The algorithm is really simple: Fix a collection of k centers. Now perform the following alternating optimization till you reach a steady state: (1) assign each point to its nearest center (2) for each group of points assigned to the same center, compute a new center by taking the centroid of the points.

More importantly, what problem does this algorithm solve ? The quick answer is: none ! Yes, that's right: k-means gives no answer with any provable guarantee for a generic set of points for any cost measure. Pretty dismal, isn't it ? And yet, it's a really popular algorithm, a state of affairs that generally drives theoreticians to tears. So what's the deal ? The answer, as always, lies in the details, and is a good story of how theoretical work has managed to make some sense of an algorithm that was notoriously hard to analyze properly.

The underlying problem that k-means attempts to solve is the one I mentioned in the last post: let a cluster cost be the sum of squared distances to the cluster center, and minimize the sum of cluster costs over all k-clusterings. Now here are some of the bad things that we CAN say about k-means:
• It might take exponential time to converge, even in the plane
• It can get stuck in a local minimum that has an arbitrarily bad cost.

Running Time:
Let's first consider the problem of guaranteeing running time. Although k-means can take exponential time to converge, a smoothed complexity result says that w.h.p, a perturbed input will converge in polynomial time (for those not familiar with the concept, the smoothed complexity of a problem is its (expected) running time after the inputs have been perturbed randomly by a small amount). The smoothed complexity results suggest one reason why k-means converges quickly "in practice". The kinds of careful constructions (high dimensional, and low dimensional) needed to force super polynomial convergence times are very artificial.

A crucial initialization:

So what about the quality of the results produced ? One unspecified aspect of the k-means algorithm is how the initial set of k centers is created. Two recent works have indicated that the key to extracting good performance from k-means is by being very careful with this initial choice.The idea is very neat, and really should be getting more play in the experimental community than I think it does.

Recall the Gonzalez algorithm for the k-center problem: pick a point, and then its furthest neighbor, and then the point whose closest neighbor in this set is as far as possible, and so on. Consider the following randomized variant:
Pick the next candidate center with probability proportional to its minimum distance squared from the current set of clusters.
Papers by Arthur and Vassilvitski, and by Ostrovksy, Rabani, Swamy and Schulman both show that this simple seeding strategy allows for provable guarantees on the quality of the solution returned by k-means. The actual results are slightly different, and give two different views of the behaviour of the algorithm:
1. The Ostrovsky et al paper starts with the assumption that the point set is epsilon-separated: the intuition behind this definition is that the cost of a k-clustering is significantly better than the cost of a (k-1)-clustering (controlled by a parameter epsilon). Under these conditions, they show that the initial seeding gives a PTAS for the clustering problem.

I'm eliding a number of details about their algorithm. One interesting feature of their algorithm is what they do after the initial seeding: rather than run Lloyd's iterative step, they fix a radius around each center (different for each center) and compute the centroid of the portion of the set within this ball. They also use a more complicated procedure to convert this into a PTAS, and in my mind that makes their approach a lot less practical.

2. The Arthur/Vassilvitski paper starts with the same initialization procedure, but they assume no epsilon-separation result. Their algorithm is simply: run the initialization, and then run regular k-means. The guarantee they get is weaker (a log k approximation ratio), but they also provide empirical results showing the superiority of their approach compared to standard k-means. I think it'd be interesting to compare their approach with the 'ball k-means' approach by Ostrovsky et al to see which work better on benchmark data sets.

Another pleasing property of the k-means algorithm is its relationship to mixture modelling and the EM algorithm. I'll have more to say about EM later on, but the gist of the matter is this: The twin steps of finding new centroids and assigning points to nearest neighbours have direct analogs in the world of maximum likelihood concepts and distribution-generated data. There is a general procedure that takes a given exponential family (Gaussian, Poisson and the like), and generates a distance function d such that the density described by the distribution can be expressed as exp(-d(x, mu)), where mu is the mean of the distribution, and x is the point whose probability we're evaluating. ML folks will recognize the Bregman distances here, and the punchline is that for Gaussian distributions, the corresponding distance function is squared Euclidean distance (the exact function used in the k-means evaluation).

What does this all mean ? It means that the k-means process has a semantic meaning in the context of clustering data drawn from a mixture of Gaussians: what's even neater is that a k-means-like algorithm works for data drawn from other distributions, as long as you replace squared Euclidean distance by the appropriate (Bregman) distance function.

Apart from the simplicity of the algorithm, there are some aspects that make it attractive, regardless of lack of guaranteed bounds. From a heuristic-design standpoint, k-means is a particularly atractive example of a technique that's often called alternating optimization.

Notes:
• I was asked if there were standard data sets for benchmarking clustering algorithms. I don't often see papers doing this in a systematic manner, but a good source of data sets for experimentation is the UCI Machine Learning Repository: there aren't too many clustering data sets, but they can be found here (look for clustering under the default task).
Next up: hierarchical methods...