Thursday, July 09, 2009

Quick note

(am posting this from 37000 ft. isn't technology cool ?)

The clustering series will be on hiatus for a few days while I wrap up some more pressing deadlines. It will restart two weeks from now.

Congratulations to Adam Smith and Sean Hallgren on getting a PECASE award. See what creating a new blog can do !

NSF EDA Workshop: Wrap up

Today morning was wrap up day. The theory group was "tasked" with coming up with a few slides suggested avenues for further cooperation. Dick talked about creating (or re-creating) an environment in academia where researchers are building chips and talking to other experts down the corridor (like theoreticians) rather than the 'planted' model where a theoretician is planted in a corporate design environment for 6 months or so.

I talked briefly about what I called "new" theory: the twin ideas of randomization and approximation that have been so powerful in algorithm design (even for intractable prolems), and how these play into the problems of dealing with massive high dimensional data.

Vijaya gave a synopsis of cache-obliviousness, multicore research, and her recent work in this area. There's a lot of interest in EDA and multicore work, and she made the argument that theoreticians are learning how to design efficient multicore algorithms and can be of great help in adapting EDA methods.

There were also four sub-panels that addressed various aspects of EDA. What to me was interesting that many of the panels brought up "EDA needs in theory" that matched some of the things we talked about. For example, there's a pressing need for methods that run linear (and even sublinear) time, tools that are incremental, in that they can progressively generate better and better solutions given more time, and tools that can parallelize well.

They also talked about providing benchmarks: for example, a 10,000 variable SAT instance and code that solves it. I thought (and said) that this would be an excellent way to get people dabbling in this area.

Randomization was a trickier matter. Although the EDA folks recognize the power of randomization, they are concerned about reproducibility. The DA pipelne is long and involved, and once you've fixed the output of a piece of the pipeline, you'd like to keep it in place and not change from iteration to iteration.

Of course this is something that can be fixed with seed control. For more portability, it's conceivable that you can cache the random coin tosses once you've settled on a reasonable solution to any piece in the pipeline.

Overall, it was quite interesting, although exhausting. The general idea with such workshops is that the findings make their way into the next call for proposals (sometime in October/November), so if you have EDA people you'd like to collaborate it, this might be a good opportunity.

It's time to go home.

NSF EDA Workshop: A comment

Paul Beame had a brilliant comment on my post about the EDA Workshop, and I liked it so much I'm reproducing it in full here:

One fault is that we don't even teach the proper tools in our algorithms courses! There are relatively few algorithms texts that even support the teaching of backtracking as an algorithmic paradigm. Moreover, the sort of very nice work with provable small exponential upper bounds that some theory researchers (e.g., David Eppstein, Richard Beigel, Martin Furer) have done in backtracking algorithms for combinatorial problems is only a very small part of the landscape for algorithmic solutions to hard problems.

Too often, we in theory leave the important hard problems to those in AI and formal methods. One of the standard paradigms in practical verification is to attack problems that in their full generality are PSPACE-hard (or worse), find a useful NP subclass, apply a reduction from these problems to SAT, and use SAT solvers that involve worst-case exponential backtracking (or to a much lesser extent local search) algorithms that are fast in many cases. The key techniques used in these solvers are powerful heuristics that are usually studied in AI but rarely in algorithms research.

The potential for practical impact of improved versions of these algorithms (even with only polynomial speed-up) could be huge. Traditional algorithmic research has occasionally had something useful to say about these topics (such as Moser's recent beautiful analysis of variants of the random walk algorithm on the special classes of CNF formulas satisfying the Lovasz Local Lemma conditions) but it is somehow not viewed as part of the mainstream.

Applications in EDA are only part of the utility of some of the exponential algorithms used in formal methods. There has been a large and surprisingly successful body of work on software model checking that uses quite clever theoretical ideas. One of my favorite examples is the following: Though techniques for checking properties of communicating finite state machines (one of those nasty PSPACE-complete problems) had been in use for EDA since the early 1990's, model checking software presented an additional problem because of the need to handle the call stack. Suppose that one needs to check a safety property (i.e., an invariant). The key observation (which goes back to the early 1960's) is that the language consisting of the possible stack contents of a PDA is always a regular language for which an NFA can be easily constructed from the PDA! One can then apply a small extension the previous algorithm to check safety properties (which must hold for all possible stack contents).

Wednesday, July 08, 2009

NSF Workshop: Electronic Design Automation

I'm currently at an NSF Worshop on Electronic Design Automation (the thing that used to be called VLSI design). I'm here as part of the 'theory audience' along with Vijaya Ramachandran and Dick Lipton (blogger power!).

Thankfully, Dick has posted an extensive summary of the day of talks, so I don't have to. Our mandate here is to listen in on the discussions and come up with a response that suggests avenues where theory folk might have something useful to contribute.

From a purely biased algorithms perspective, one thing that strikes me about the EDA community (at least in the formal methods/verification realm) is their unwillingness to give up in the face of beyond-NP-completeness. What I mean is this: most of my training in algorithms (and this is likely true for you as well) is in the polynomial-time regime: all the algorithic paradigms we learn are effective at reducing the complexity of an algorithm from one polynomial to another.

When we engage with NP-hardness, we switch modes to approximations, and focus on the issue of quality: even the approximation algorithms themselves run in poly time. There are very few people (David Eppstein comes to mind) who work on algorithms in the exponential/subexponential realm and will worry about (say) reducing the base of the exponent for SAT or graph coloring or other hard problems.

The verification folks don't necessarily solve their (very hard) problems exactly, but they do design all kinds of tricks (and heuristics) to deal with these problems, because they actually need to solve them ! In my view, it wouldn't be a bad idea for students learning algorithms to learn at least a few tricks for designing algorithms that might run in exponential time, but are efficient. Remember that exponential might be better than n^100 for many values of n.

One thing that came to mind as I listened to talks. With the exception of a talk by Rupak Mazumdar on faulty computations, and a talk by Ed Clarke (yes, that Ed Clarke) on statistical model checking (based on the Neyman-Pearson hypothesis testing framework), there was little talk of the role that randomization might have to play in the problems of EDA.

A second thought was how the lessons of massive data analysis might be useful in the realm of DA. One speakr described one critical problem as being the degree of complexity associated with current DA tools: there are over 4000 "knobs" to turn in one such tool ! It's believed that these knobs are not independent, and might even be contradictory. If we think of each "run" of the DA tool, outputing some kind of chip layout, as a point in this 4000+ dimensional space, I wonder whether techniques for dimensionality reduction and manifold analysis might be useful to find a set of "core knobs" that control the process.

I have to say that it's nice to attend a workshop with a community that throws out terms like NP-Complete, \Sigma_2, and PSPACE so freely :).

Friday, July 03, 2009

FOCS 2009 results.

The FOCS list is out, (with abstracts). 73 papers accepted. Some papers that popped up on my radar screen (post more in the comments since I'll probably miss many good ones):
  • Matt Gibson and Kasturi Varadarajan. Decomposing Coverings and the Planar Sensor Cover Problem.

    This paper continues a line of work that I like to think we started in a SODA paper a few years ago. The problem is this: you're given a collection of sensors that monitor a region. But they have limited battery life, and they're also redundant (many different subsets of sensors have the range to cover the region). Can you schedule the sensors to go on and off so that at all times, the entire region is covered, and the region is covered for as long as possible ?

    Early work on this problem formulated it in a combinatorial setting, which immediately led to things like log n-approximability lower bounds via set cover variants. Our belief was that the geometry of the regions would make things a bit easier. We made some small progress, getting a sublogarithmic approximation ratio for intervals on the line, and a constant factor for ranges with special structure. Subsequent work made steady improvements, and this work has brought things down to a constant for general classes of regions in the plane
  • Saugata Basu and Thierry Zell. Polynomial hierarchy, Betti numbers and a real analogue of Toda's Theorem

    Betti number computation has been a hot topic in computational geometry of late, but the idea of using Betti numbers to bound the (algebraic complexity) of basic geometric operations dates back to work by Dobkin and Lipton that relates the complexity of certain computations to the number of connected components of "invariant paths" in a computation. Ben-Or generalized these results to more complicated algebraic computation trees, and noting that "number of connected components" is the zero-th Betti number of the set of computations, asked whether higher-order Betti numbers might be useful for proving lower bounds. A series of results by Bjorner, Lovasz and Yao, Bjorner and Lovasz, and finally Yao showed that indeed the higher-order Betti numbers can be used to lower bound the complexity of algebraic computations.

    So Betti numbers are important as a measure of "counting" in algebraic complexity. This paper reinforces this intuition, by relating #P over the reals to the complexity of computing Betti numbers over semi-algebraic sets. Interestingly, this paper mentions the difficult nature of Toda's original proof, and Lance just recently tweeted a new very simple proof of Toda's theorem that he just published in ToC.

  • David Arthur, Bodo Manthey and Heiko Roeglin. k-Means has Polynomial Smoothed Complexity

    Read my post on k-means for more on the significance of such results. This paper finally resolves the smoothed complexity issue, giving a polynomial bound (expected).

  • Peyman Afshani, Jeremy Barbay and Timothy M. Chan. Instance-Optimal Geometric Algorithms

    An interesting result that seems to be able to take away order-dependence from some geometric problems.

  • T.S. Jayram and David Woodruff. The Data Stream Space Complexity of Cascaded Norms

    Cascaded norms (where you compute one aggregate on each of a collection of streams, and compute another aggregate on these aggregates) are tricky for streaming algorithms, and this paper provides some interesting results. Again, I'm waiting for the document to show up, and I'd be interesting in seeing the kinds of techniques they use.

Papers I'd like to learn more about:


Comments ?

Clustering: k-means...

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

k-means (also known as Lloyd's algorithm) is probably the most well known clustering algorithm, and as such, deserves some discussion of its own. We've already seen that it's part of the "I don't like you" family of clustering formulations. How does the method work ?

We start with a set of points in R^d. The algorithm is really simple: Fix a collection of k centers. Now perform the following alternating optimization till you reach a steady state: (1) assign each point to its nearest center (2) for each group of points assigned to the same center, compute a new center by taking the centroid of the points.

More importantly, what problem does this algorithm solve ? The quick answer is: none ! Yes, that's right: k-means gives no answer with any provable guarantee for a generic set of points for any cost measure. Pretty dismal, isn't it ? And yet, it's a really popular algorithm, a state of affairs that generally drives theoreticians to tears. So what's the deal ? The answer, as always, lies in the details, and is a good story of how theoretical work has managed to make some sense of an algorithm that was notoriously hard to analyze properly.

The underlying problem that k-means attempts to solve is the one I mentioned in the last post: let a cluster cost be the sum of squared distances to the cluster center, and minimize the sum of cluster costs over all k-clusterings. Now here are some of the bad things that we CAN say about k-means:
  • It might take exponential time to converge, even in the plane
  • It can get stuck in a local minimum that has an arbitrarily bad cost.

Running Time:
Let's first consider the problem of guaranteeing running time. Although k-means can take exponential time to converge, a smoothed complexity result says that w.h.p, a perturbed input will converge in polynomial time (for those not familiar with the concept, the smoothed complexity of a problem is its (expected) running time after the inputs have been perturbed randomly by a small amount). The smoothed complexity results suggest one reason why k-means converges quickly "in practice". The kinds of careful constructions (high dimensional, and low dimensional) needed to force super polynomial convergence times are very artificial.


A crucial initialization:

So what about the quality of the results produced ? One unspecified aspect of the k-means algorithm is how the initial set of k centers is created. Two recent works have indicated that the key to extracting good performance from k-means is by being very careful with this initial choice.The idea is very neat, and really should be getting more play in the experimental community than I think it does.

Recall the Gonzalez algorithm for the k-center problem: pick a point, and then its furthest neighbor, and then the point whose closest neighbor in this set is as far as possible, and so on. Consider the following randomized variant:
Pick the next candidate center with probability proportional to its minimum distance squared from the current set of clusters.
Papers by Arthur and Vassilvitski, and by Ostrovksy, Rabani, Swamy and Schulman both show that this simple seeding strategy allows for provable guarantees on the quality of the solution returned by k-means. The actual results are slightly different, and give two different views of the behaviour of the algorithm:
  1. The Ostrovsky et al paper starts with the assumption that the point set is epsilon-separated: the intuition behind this definition is that the cost of a k-clustering is significantly better than the cost of a (k-1)-clustering (controlled by a parameter epsilon). Under these conditions, they show that the initial seeding gives a PTAS for the clustering problem.

    I'm eliding a number of details about their algorithm. One interesting feature of their algorithm is what they do after the initial seeding: rather than run Lloyd's iterative step, they fix a radius around each center (different for each center) and compute the centroid of the portion of the set within this ball. They also use a more complicated procedure to convert this into a PTAS, and in my mind that makes their approach a lot less practical.

  2. The Arthur/Vassilvitski paper starts with the same initialization procedure, but they assume no epsilon-separation result. Their algorithm is simply: run the initialization, and then run regular k-means. The guarantee they get is weaker (a log k approximation ratio), but they also provide empirical results showing the superiority of their approach compared to standard k-means. I think it'd be interesting to compare their approach with the 'ball k-means' approach by Ostrovsky et al to see which work better on benchmark data sets.


Another pleasing property of the k-means algorithm is its relationship to mixture modelling and the EM algorithm. I'll have more to say about EM later on, but the gist of the matter is this: The twin steps of finding new centroids and assigning points to nearest neighbours have direct analogs in the world of maximum likelihood concepts and distribution-generated data. There is a general procedure that takes a given exponential family (Gaussian, Poisson and the like), and generates a distance function d such that the density described by the distribution can be expressed as exp(-d(x, mu)), where mu is the mean of the distribution, and x is the point whose probability we're evaluating. ML folks will recognize the Bregman distances here, and the punchline is that for Gaussian distributions, the corresponding distance function is squared Euclidean distance (the exact function used in the k-means evaluation).

What does this all mean ? It means that the k-means process has a semantic meaning in the context of clustering data drawn from a mixture of Gaussians: what's even neater is that a k-means-like algorithm works for data drawn from other distributions, as long as you replace squared Euclidean distance by the appropriate (Bregman) distance function.

Apart from the simplicity of the algorithm, there are some aspects that make it attractive, regardless of lack of guaranteed bounds. From a heuristic-design standpoint, k-means is a particularly atractive example of a technique that's often called alternating optimization.

Notes:
  • I was asked if there were standard data sets for benchmarking clustering algorithms. I don't often see papers doing this in a systematic manner, but a good source of data sets for experimentation is the UCI Machine Learning Repository: there aren't too many clustering data sets, but they can be found here (look for clustering under the default task).
Next up: hierarchical methods...

Monday, June 29, 2009

SODA 2010 submission server

Remember: abstracts are due today at 5pm ET.

Sunday, June 28, 2009

Clustering: The "I don't like you" view

(this is part of an occasional series of essays on clustering: for all posts in this topic, click here)

Any clustering problem starts with a data set. And you can basically tell what algorithm you're going to need by asking, 'what structure do I have on the data ?'. The bare minimum you need is some way of taking two items and returning a number that tells you how similar they are, or more usefully, how far apart they are.

Of course, if I know that A and B are a certain distance apart, and B and C are a certain distance apart, I might like to infer something about the distance between A and C. This is what the triangle inequality gives you, and so the most elementary form of clustering takes items and a metric defined on these items (humans actually DON'T think about distances in this manner, as many psychological studies have shown: more on that later on).

I find it convenient to view clustering algorithms from a pop-psychology self-help perspective, and so I'll deem algorithms that operate with a distance metric the 'I don't like you' methods. The general goal with such methods is to group the items into "clusters" where there isn't too much hatred :).

There are many ways to quantify this "lack of hatred". The one most natural to theoreticians is the 'minimize the worst-case angst' viewpoint: in other words, find k clusters such that over all clusters, the maximum pairwise distance between points in a cluster is minimized. This is the well-known k-center problem.

If you don't like the lack of robustness (one point can mess up the cost), you could instead sum up all the pairwise distances to assign the cost for a cluster. This gives a variant of what's normally considered the k-median problem.

A third way of measuring cluster cost is to sum the squares of distances, rather than the distances themselves. This gives you a cost function that looks a lot like what k-means attempts to solve.

It's often useful to define a representative of a cluster: in the k-center view of the world, the center of a cluster is a point whose maximum distance to all points in the cluster is minimum. Note that by triangle inequality, this is within a factor of two of the k-center cluster cost, so it's often convenient to use this as the definition of the cluster cost (the cluster radius, instead of diameter, if you will). In the sum-of-all-costs view, the center correspondingly can be defined as as the point whose sum of all distances to all other points is minimized. Similarly for the case of sum-of-square-costs.

Remarks on complexity:
There is an elegant 2-approximation for the k-center problem by Gonzalez. The algorithm itself is the essence of simplicity: pick any point in the space, take its furthest neighbor, take the point furthest from these two, and so on. The first k points picked in this manner yield the desired cluster centers, and a simple triangle inequality argument yields the 2-approximation.

The k-median problem is a little harder to solve. The best known bound in a general metric space is a 3+eps approximation, and it's known that you can't get a PTAS. The approximation algorithm itself is not too hard: it's a local search heuristic that gets closer and closer to 3 as you permit larger and larger sets of centers to be swapped out.

Going to Euclidean space, or what's in a name ?
General metrics come either from some kind of similarity function between pairs of elements or from the induced shortest path metric on a graph (there are other metrics one can induce from a graph, and we'll talk about them later). But what if you have more information ? There's a general principle of clustering that's worth enunciating here: more structure on the data can lead you to more targeted algorithms that will probably work better, so resist the urge to be too general.

The most natural space to consider is a Euclidean space, and it is this space that explains the names 'median' and 'mean'. It's well known that on the line, the point minimizing the sum of distances to a set of points (the 1-median) is given by the actual median of the numbers. This explains why the problem of minimizing sum of distances is called the k-median problem. Similarly, the point that minimizes the sum of squared Euclidean distance is the centroid or the 'mean', giving rise to the k-means problem.

There's a long line of results on clustering under various measures in Euclidean space. In brief, for most of the above formulations, you can get a PTAS in Euclidean space, but you end up playing whack-a-mole with the parameters n, d, epsilon, and k (that is to say, something ends up exhibiting exponential dependence, and which parameter it is depends on what you care most about).

The Voronoi trick
But there's a general principle that governs much of the algorithm design for these problems. It's the so-called Voronoi trick, and can be summarized by pointing out that you can always improve the clustering cost by making sure that each point is assigned to its nearest neighbor. Effectively this means that the clusters define a Voronoi partition, with the center as a site, and all points within a Voronoi cell being assigned to the corresponding site. First of all, this means that the number of possible outcomes reduces from k^n (the number of ways of partitioning n things into k groups) to n^{kd} (since the number of Voronoi partitions is much smaller than the number of partitions). Secondly, it suggests a simple heuristic for improvement: Take a given clustering: reassign points so that each point is assigned to its nearest cluster center, and then recompute the center for the cluster associated with all points in the (erstwhile) Voronoi cell. Rinse and repeat....

This heuristic is the basis of the k-means algorithm, and is also the basis for a heuristic for the k-median problem clumsily named the k-medoids algorithm.

Notes:
  • all the above algorithms have the parameter k for the number of clusters. It's obvious why: all the above cost measures can be minimized by placing each item in a cluster by itself, but the resulting answer is meaningless, and "no man is an island"
  • In a general metric space defined over a finite set of points, the cluster centers will by default by elements of the input. This is often a desirable property, if you want representatives to be from the input. But in a smooth space like a Euclidean space, there's no reason to limit yourselves to data points, and this creates a dichotomy between the so-called discrete and continuous variants of a clustering problem. Often, you can use triangle inequality to argue that the answers don't differ significantly in cost, but it's important to ask yourself which variant you care about. The discrete problem can often be "easier" to solve, because you can enumerate over the choices, but this "easier" can be expensive: for example, minimizing the sum of squares is easy in a (continuous) Euclidean space, but is more expensive in a (discrete) metric space.
  • For some unknown reason, it's common to describe clustering problems by the algorithm used to "solve" them, which causes no end of confusion. I lost count of the number of times I had to point out that k-means is an algorithm that attempts to optimize a specific cost function. This is more than just pedantry, because unless you realize the difference between a problem and a specific solution, you'll never be able to conceive of an alternate approach, and you'll never even try to abstract out what's special about your problem. To theoreticians, this might be obvious, but it isn't really obvious IRL.
Next up: everything you ever wanted to know about k-means.

p.s this is written in a much more breezy style than I'd use for any formal document, but the organization and thinking reflects how I'd like to structure the material. Comments, as always, are welcome.

Friday, June 26, 2009

"Bloggers", the media and science reporting.

Sometimes I read an article that I just cannot understand. The words make sense, and I can understand the logical flow, but the entire premise just escapes me. Consider the following exhibit: an article in Nature News on the "problem" of blogging from a conference. A snippet from the article (which I recommend reading in its entirety):
By disseminating scientific results far beyond the lecture hall, blogging and social networking blurs the line between journalists and researchers. Scientists in competitive fields may be more reluctant to discuss new findings if they can be posted on the Internet within seconds.
and then later:
MacArthur's comprehensive postings were read by many scientists but they irked journalists attending the meeting. The meeting rules stated that reporters had to seek permission from speakers before publishing material on their work, rules that Cold Spring Harbor instituted in part because some journals, such as Nature, discourage scientists from talking to the press before their work is published. But those rules didn't apply to scientist-bloggers like MacArthur and, after he posted details from a few talks, reporters contacted Stewart for clarification on the policies. The complaint was a wake-up call: "For the first time, I became aware that people were blogging about the data at the meeting," Stewart says.
If I understand this correctly, the premise of the article is that blogging from a conference is bad because it
  • "scoops" the poor journalists under embargo ?
  • disseminates information faster than someone might like ?
to which my only response is "HUH " ? Isn't the whole point of a PUBLIC presentation to disseminate results to - you kn0w- THE PUBLIC ? and how on earth does it matter how fast this happens ? And although I feel for the journalists who agree to weird embargo rules by control-freak journals, why on earth should I, as a researcher who happens to blog from conferences, care about this ?

Sometimes I think the media needs to get over itself....

Thursday, June 25, 2009

Clustering: An Occasional Series

Clustering is one of those areas that epitomizes the 'ooze' principle in data mining: it's 3 miles wide and an inch deep. Every year, copious ink get spilt on some clustering method or another, and yet it's not really clear what key ideas are driving the field forward.

I ran a seminar this spring on clustering. I wanted to see if there was a way of organizing material on clustering in a way that was broad without necessarily being exhaustive, but that hit the "eigenvectors" of the field, so to speak. I was surprised that for a topic as important and broad-based as clustering, there was little out there in the way of textbook-level source material to work from. Most texts that I did find were too specialized for my taste, and didn't have the kind of broad perspective of clustering that I felt was critical to really understanding the area.

So I organized my material, and was surprised to find that there actually was a way of collecting the base material on the topic in a way that could be covered in a semester format without leaving crucial pieces out. What I did not try to do:
  • Cover the details of the best/greatest/fastest algorithm for each problem. In other words, I made a conscious decision to shy away from theoretical overkill.
  • Try to cover all the variants of different clustering methods. I felt that as long as one knew what k-means was, and what spectral clustering was, there was no real need to study papers that tried to put them together in strange ways.
What I did try to do was:
  • Where possible, try to evaluate the practical significance of the methods (so a linear time clustering algorithm that actually had a running time doubly exponential in epsilon would not necessarily make the cut, unless it had some other redeeming features)
  • Try to view the methods from the perspective of the user: given some data, with some structure, what are your options in terms of clustering algorithms ?
The results of this organization can be found here. I was relatively happy with the outcome, and I think the students were too. In fact, many of the discussions in class were interesting enough that I put them together in a collection of notes.

What I want to do in this occasional series is sketch out the "top-level" view of clustering as I see it, and as inspired by our seminar. My vague idea was that these notes might eventually be converted into some kind of monograph, or at the very least a web document that others might find useful. So I'm going to "test-drive" the ideas here, in the hope that my very well-informed readers will add to the discussion, and improve it (as usually happens).

First up: the basic trinity - k-(center/median/means).