Friday, June 03, 2005

Sums of squares in a stream

It's time to talk a little about theorems. Fortunately these are fun theorems. Over the last few years I've worked a lot in the area of data streams. In this model, the input is provided to you in some arbitrary order, and you must compute some function on this input using space and time per item that are sublinear in (preferably polynomial in the logarithm of) the size of the input. Suresh already talked about some of the problems after the "AMS" paper very deservedly won the Godel award this year. Let me give some insight into some solutions.

Consider the (very abstract, but with a surprising number of applications) problem of computing the sum of squares of the input. If the input is given just as the sequence of values, then this problem is easy: keep a counter, and for every new update x, add on x*x to the counter.

But now suppose the stream defines the input in a different way: now, we see a sequence of items, and we want to track information about the sum of the squares of the number of occurrences of each item. So now every update is an item, i, and we want to compute the sum of the squares of the number of times each item i appears.

If the number of different items isn't too large, then again, we don't have a problem: just keep a counter for each item and increment it when we see a new copy of the item in the stream. But in many settings (IP networks being the motivating example) both the size of the data and the domain of the set of items can be very large, much bigger than we would like.

Think about this problem without knowing that there is a neat solution, and you might conclude that it can't be done. When you see a new item, it seems that you have to know how many times before it has been seen in order to make a good estimate of the sum of squares. This turns out not to be the case.

We can make a good estimator for the sum of squares by using a beautiful randomized technique due to Alon, Matias and Szegedy, and which has been modified and used for a wide variety of other problems. And we still only need a single counter! (well, sort of...)

Keep a counter z. For every item i that arrives in the stream, compute h(i), a hash function that maps i onto either +1 or -1. Add this value onto z. When we want to estimate the sum of the squares, compute z*z.

How does this work? Well, the final value of z after seeing the whole stream is equal to the sum over all i's of frequency of i times h(i). When we square this, we get the sum of the frequencies squared times h(i) squared, plus cross terms with factors of h(i)h(j) for i not equal to j.

Because h(i) is always +1 or -1, then the h(i) squared terms are all 1. While, if h is "sufficiently random" then the cross terms have expectation zero. So in expectation, the estimate has exactly the right value!

To turn this into a good estimator, and ensure that it can be implemented in small space, one needs to do some more work. Firstly, one can show with a little more working that the variance of this estimator can be bounded in terms of the square of its expectation. Then, uttering the magic words "Markov Chernoff Chebyshev!" we can argue that with enough repetitions of this procedure a very accurate estimate can be built. We can also argue that the hash functions h() do not have to be "fully independent" (since this would require linear space to represent truly independent hash function, hence missing the point), but need only very weak independence, meaning they can be drawn from a family of hash functions with a small representation.

This algorithm is not only beautiful for its simplicity and elegance; it is also the archetype for many other streaming algorithms which came after. Although a variety of techniques have been developed for this small space, one pass model of computations, the ideas of linear transformations (since this is essentially computing a linear transformation of the input), hash functions with limited independence and careful proabilistic analysis have been vital to many algorithms.

The approach is also surprisingly practical. The factors needed to make this into a very accurate summary are not too large, and with some more implementation tricks and fast implementations of hash functions, the data structures can be updated at very high update rates -- even in the order of millions of updates per second.

No comments:

Post a Comment

Disqus for The Geomblog