If you study databases for a living, or work with large data sets, you will know that the amounts of data that computers are being asked to process is growing astronomically. Google crawls over 8 billion pages on a regular basis, and if you start monitoring data on networks, you are looking at gigabytes of fresh data streaming in every day. If you're Walmart for example, you might have sales data showing up every minute, and if you're an astronomer observing on a telescope, again you get humungous (ed: 'humungous' is a technical term that translates as 'really a lot of stuff, and I am too lazy to figure out exactly how much') amounts of data a constant basis.
Most of this data cannot be stored; it is just infeasible, even with storage as cheap as it is, to store everything you get. More often than not, you want to compute a small summary of the data and throw the rest away. Even if you are willing to archive large portions of your data, it is too large to do careful processing over. Most often you will scan the data ONCE, and retain whatever you can maintain usefully. As database folks know well, accessing a disk is a lot cheaper if you do it in sequential order.
Remember though, that memory in your computer is nowhere near as large as your disk. A state of the art PC might have 4GB of RAM, and over 500 GB of disk, and servers have even more disk space, with not necessarily much more RAM. So here's the problem:
If I can't even read my data into memory to process, or I am drinking at this hose that's blasting data at me, how can I compute summaries efficiently with only a small amount of main memory ?This is the prime motivating question for the area of streaming algorithms, an area of frenetic research in algorithms and databases over the past 10 years. For a survey, read this tech report by S. Muthukrishnan.
The AMS paper looked at the following simple problems on a stream of numbers:
- Can I count the number of unique items ?
- Can I estimate the variance of the items ?
You might think it's easy to count F0 and F1. You'd be partly right. I can count F1 by adding up all the items as they come in. I need only space to store the sum. But if I want to count the number of unique items, I have to maintain one memory location for each new item that comes in, and that could mean needing n memory locations for n items, which violates the rules of this game (memory used has to be MUCH LESS than the size of the data).
In general, for frequency counts other than F1, I am in a bind. I can't maintain all the frequencies fi, because that would need me to maintain as much space as there were distinct items (and in a worst case analysis this is proportional to the size of the data). So what can I do ?
This is the space of problems attacked by AMS: their main contributions were:
- Good news: F2 can be approximately computed in logarithmic space; in other words, proportional to the space required to store a single element of the stream
- Bad news: if you are not willing to toss coins (use a randomized algorithm) and be content with an approximate solution, you will need space linear in the input stream.
- Tight results for different k.
This paper, along with later work by Henzinger, Rajagopalan and Raghavan and Feigenbaum, Kannan, Strauss and Vishwanathan, laid the foundations for the entire field of streaming algorithms. The time was right in the middle 90s for streaming to emerge as a real field with real applications. Interestingly, some of the nascent ideas date back much further (to a paper by Flajolet and Martin in 1983 for computing F0, and a paper by Morris on computing F1 in log log n space (!) in 1978, in the CACM-as-a-good-technical-publication era).
The field has expanded greatly since AMS, but the problems they studied are still of great interest. In fact at this very STOC, a paper by Indyk and Woodruff closes the final open problems left behind by the AMS paper.
Tags: streaming, frequency, moments