Wednesday, July 25, 2012

Data Streaming in Dortmund: Day II

Andrew McGregor and I are tag-blogging the workshop. Read his post on day 1 here.

Day II could roughly be summarized as:

Sitting by the stream recovering from a sparse attack of acronyms. 

There were a number of talks on sparse recovery, which includes compressed sensing, and asks the question: How best can we reconstruct a signal from linear probes if we know the signal is sparse.

Eric Price led things off with a discussion of his recent breakthroughs on sparse FFTs. While the regular DFT takes $n \log n$ time, it's reasonable to ask if we can do better if we know the signal has only $k$ nonzero Fourier coefficients. He talked about a sequence of papers that do this "almost optimally", in that they improve the $n\log n$ running time as long as the sparsity parameter $k = o(n)$.

Anna Gilbert provided an interesting counterpoint to this line of work. She argued that for real analog signals the process of sensing and recording them, even if the original signal was extremely sparse, can lead to discrete signals that have $\Theta(n)$ nonzero Fourier coefficients, and this is in some way intrinsic to the sensing process. This was part of an attempt to explain why a long line of sublinear methods (including Eric's work) don't do very well on real signals. This line of attack is called 'off the grid" reconstruction because you're off the (discrete) grid of a digital signal.

In sparse recovery, there are two parameters of interest: the number of measurements you make of the signal, and the time for reconstruction. Obviously, you'd like to get both to be as small as possible, and information-theoretic arguments show that you have to spend at least $k \log(n/k)$ measurements. Martin Strauss focused on speeding up the reconstruction time while maintaining measurement-optimality, in a setting known as the "for-all" formulation, where the adversary is allowed to pick a signal after seeing the probe matrix (there's a related "foreach" model that is subtlely different, and I'm still a little confused about the two).

On a slightly different note (but not that different as it turns out), Atri Rudra talked about a fun problem: Given the "shadows" of a set of 3D points along three orthogonal planes, can you find the minimum number of points that could yield the specific shadows ? If all projections have complexity $n$, it's well known that the right bound is $n^{3/2}$. While this bound was known, it wasn't constructive, and part of Atri's work was providing an actual construction. There are all kinds of interesting connections between this problem, join queries, triangle counting in graphs, and even the scary-hairy 3SUM, but that's a topic for another time.

Other talks in the afternoon: Alex Andoni talked about finding the eigenvalues of a matrix efficiently on streams (specifically finding the "heavy" eigenvalues). I learnt about the Cauchy interlacing theorem from his talk - it's a neat result about how the eigenvalues of submatrices behave.  Ely Porat talked about the problem of preventing evil entities Hollywood from poisoning a BitTorrent stream of packets, and presented ideas involving homomorphic signatures for packets via error correcting codes.

Joshua Brody returned to the basic streaming setup. Most stream algorithms that estimate some quantity introduce two-sided error (the true estimate could be above or below the reported value). He asked whether this was necessary to stay with sublinear bounds: it turns out for some problems, limiting yourself to 1-sided error can worsen the space complexity needed to solve the problem (note that for problems like the Count-min sketch, the estimate is one-sided by design, and so there's no deterioriation)

Coming up next: Andrew's summary of day three, and a report on our heroic tree-swinging adventures.

2 comments:

  1. This means that if the streaming client receives the data more quickly than required, it needs to save the excess data in a buffer

    ReplyDelete
  2. This means that if the streaming client receives the data more quickly than required, it needs to save the excess data in a buffer

    ReplyDelete

Disqus for The Geomblog