The Geomblog: Streaming Algorithms

Monday, May 23, 2005

Streaming Algorithms

This year's Gödel Prize went to Noga Alon, Yossi Matias and Maro Szegedy for their 1996 paper titled 'The space complexity of approximating the frequency moments". Why ? What was so great about this paper (which we shall call the 'AMS' paper) ?

If you study databases for a living, or work with large data sets, you will know that the amounts of data that computers are being asked to process is growing astronomically. Google crawls over 8 billion pages on a regular basis, and if you start monitoring data on networks, you are looking at gigabytes of fresh data streaming in every day. If you're Walmart for example, you might have sales data showing up every minute, and if you're an astronomer observing on a telescope, again you get humungous (ed: 'humungous' is a technical term that translates as 'really a lot of stuff, and I am too lazy to figure out exactly how much') amounts of data a constant basis.

Most of this data cannot be stored; it is just infeasible, even with storage as cheap as it is, to store everything you get. More often than not, you want to compute a small summary of the data and throw the rest away. Even if you are willing to archive large portions of your data, it is too large to do careful processing over. Most often you will scan the data ONCE, and retain whatever you can maintain usefully. As database folks know well, accessing a disk is a lot cheaper if you do it in sequential order.

Remember though, that memory in your computer is nowhere near as large as your disk. A state of the art PC might have 4GB of RAM, and over 500 GB of disk, and servers have even more disk space, with not necessarily much more RAM. So here's the problem:

If I can't even read my data into memory to process, or I am drinking at this hose that's blasting data at me, how can I compute summaries efficiently with only a small amount of main memory ?

This is the prime motivating question for the area of streaming algorithms, an area of frenetic research in algorithms and databases over the past 10 years. For a survey, read this tech report by S. Muthukrishnan.

The AMS paper looked at the following simple problems on a stream of numbers:

Can I count the number of unique items ?
Can I estimate the variance of the items ?

In general, if I associate the number f_i with the frequency of the number i, can I compute functions of the form F_k = ∑ f_i^k, where F₀ thus denotes the number of unique items, F₁ the sum, F₂ the sum of squares (which can be used to compute variance) and so on.

You might think it's easy to count F₀ and F₁. You'd be partly right. I can count F₁ by adding up all the items as they come in. I need only space to store the sum. But if I want to count the number of unique items, I have to maintain one memory location for each new item that comes in, and that could mean needing n memory locations for n items, which violates the rules of this game (memory used has to be MUCH LESS than the size of the data).

In general, for frequency counts other than F₁, I am in a bind. I can't maintain all the frequencies f_i, because that would need me to maintain as much space as there were distinct items (and in a worst case analysis this is proportional to the size of the data). So what can I do ?

This is the space of problems attacked by AMS: their main contributions were:

Good news: F₂ can be approximately computed in logarithmic space; in other words, proportional to the space required to store a single element of the stream
Bad news: if you are not willing to toss coins (use a randomized algorithm) and be content with an approximate solution, you will need space linear in the input stream.
Tight results for different k.

Their paper was the first to use communication complexity to prove lower bounds on stream algorithms. This is too long a topic to get into now, but it is one of the most powerful methods for proving limits on streaming, and most current lower bounds use this idea.

This paper, along with later work by Henzinger, Rajagopalan and Raghavan and Feigenbaum, Kannan, Strauss and Vishwanathan, laid the foundations for the entire field of streaming algorithms. The time was right in the middle 90s for streaming to emerge as a real field with real applications. Interestingly, some of the nascent ideas date back much further (to a paper by Flajolet and Martin in 1983 for computing F₀, and a paper by Morris on computing F₁ in log log n space (!) in 1978, in the CACM-as-a-good-technical-publication era).

The field has expanded greatly since AMS, but the problems they studied are still of great interest. In fact at this very STOC, a paper by Indyk and Woodruff closes the final open problems left behind by the AMS paper.

Tags: streaming, frequency, moments

2 comments:

Anonymous5/31/2005 10:24:00 PM
Streaming algorithms also have applications in network intrusion detection systems (IDS). There's been some interesting work in the area recently. I suspect there's more work waiting if the right mapping between IDS problems and problems with good streaming algorithms can be worked out. I wrote a short overview here
http://www.livejournal.com/users/ephermata/71782.html

The key is finding a problem that IDS consumers need solved quickly, though. That seems harder to pick out without more knowledge of IDS.

Posted by David Molnar
ReplyDelete
Replies
Anonymous5/31/2005 10:38:00 PM
There is some interesting work combining streaming algorithms with network intrusion detection systems. See
http://www.livejournal.com/users/ephermata/71782.html

for some overview.

Posted by David Molnar
ReplyDelete
Replies

The Geomblog

Pages

Monday, May 23, 2005

Streaming Algorithms

2 comments:

Disqus for The Geomblog