Sunday, May 19, 2013

Coding, Complexity and Sparsity 2013 (SPARC) 2013.

Atri Rudra reminds us that the 3rd incarnation of SPARC is soon to be upon us (disclaimer: I'll be speaking at the event):
Efficient and effective transmission, storage, and retrieval of information on a large-scale are among the core technical problems in the modern digital revolution. The massive volume of data necessitates the quest for mathematical and algorithmic methods for efficiently describing, summarizing, synthesizing, and, increasingly more critical, deciding when and how to discard data before storing or transmitting it. Such methods have been developed in two areas: coding theory, and sparse approximation (SA) (and its variants called compressive sensing (CS) and streaming algorithms). 
Coding theory and computational complexity are both well established fields that enjoy fruitful interactions with one another. On the other hand, while significant progress on the SA/CS problem has been made, much of that progress is concentrated on the feasibility of the problems, including a number of algorithmic innovations that leverage coding theory techniques, but a systematic computational complexity treatment of these problems is sorely lacking. The workshop organizers aim to develop a general computational theory of SA and CS (as well as related areas such as group testing) and its relationship to coding theory. This goal can be achieved only by bringing together researchers from a variety of areas. 
The Coding, Complexity and Sparsity workshop (SPARC 13) will be held in Ann Arbor, MI on Aug 5-7. 
These will be hour-long lectures designed to give students an introduction to coding theory, complexity theory/pseudo-randomness, and compressive sensing/streaming algorithms. We will have a poster session during the workshop and everyone is welcome to bring a poster but graduate students and postdocs are especially encouraged to give a poster presentation. 
This is the third incarnation of the workshop and the previous two workshops were also held in Ann Arbor in August of 2011 and 2012. 
Confirmed speakers:
* Jin Yi Cai (University of Wisconsin, Madison)
* Shafi Goldwasser (MIT)
* Piotr Indyk (MIT)
* Swastik Kopparty (Rutgers University)
* Dick Lipton (Georgia Tech)
* Andrew McGregor (University of Massachusetts, Amherst)
* Raghu Meka (IAS)
* Jelani Nelson    (Harvard)
* Eric Price (MIT)
* Christopher RĂ© (University of Wisconsin, Madison)
* Shubhangi Saraf (Rutgers University)
* Suresh Venkatasubramanian (University of Utah)
* David Woodruff (IBM)
* Mary Wootters    (Michigan)
* Shuheng Zhou (Michigan)

We have some funding for graduate students and postdocs with preference given to those who will be presenting posters. For registration and other details, please look at the workshop webpage: http://eecs.umich.edu/eecs/SPARC2013/

Friday, May 10, 2013

On GPU algorithms

Lance talks about GPU algorithms in his latest post:
The theory community hasn't seem to catch on yet. There should be some nice theoretical model that captures the vector and other operations of a GPGPU and then we should search for algorithms that make the best use of the hardware. The theory world writes algorithms for non-existent quantum computers but not for the machines that we currently use.
Well.

As someone who was doing GPU algorithms back when this meant flipping state bits on the fixed function pipeline (this is the GPU analog of "I wrote video games in assembly"), maybe a little perspective is in order.

Aside: I actually gave a talk on GPU algorithms at Chicago when Lance was there back in 2004. Clearly I wasn't interesting enough to get him to take note :).

Back in the early 2000s, I got quite interested in GPU algorithms. This came about from a summer project that Nabil Mustafa was doing with Shankar Krishnan and myself on real-time map simplification. Shankar had this cool idea to try and accelerate the Douglas-Peuker algorithm by implementing it on a GPU, which at the time meant trying to retrofit the algorithm to a very strange OpenGL rendering pipeline.

One thing led to another, and it got us thinking more abstractly about what kinds of computations could be done on the GPU. This was pre-CUDA, which meant that essentially we could abstract the GPU machine model as a massively parallel SIMD architecture in which each "processor" ran a simple straight line program (no loops, limited conditionals, and small memory - much like a streaming algorithm). The parallelism lent itself nicely to geometric problems, and we wrote papers on map simplification, depth contours, and general geometric optimization. We also designed an algorithm for a CSG rendering problem that had buried inside it a simple streaming model for proving lower bounds: in fact we were able to show a exponential gap (in the number of streaming passes needed) between deterministic and randomized median finding in this model (the conference wasn't interested in the lower bounds though :)).

In a bout of perfect timing, I stopped working on GPUs a year or so before CUDA. At the time, my reasons were simple. I was tired of the computational model changing on me every six months, and didn't have the resources to keep buying the latest and greatest graphics cards (although AT&T was very generous in getting me a state of the art card originally). It was also annoying that in order to really exploit the power of the hardware, I needed secret sauce that was only revealed if you had NVIDIA inside connections (or went to the NVIDIA U events).

Then CUDA came along and everyone went nuts. If you go to hgpu.org (originally gpgpu.org) you'll see the number of papers being published every year on GPU-accelerated methods. CUDA changed (quite radically) the way in which we thought about GPU algorithms: the memory model became more sophisticated, and the SIMD component was more powerful, although it was still a good idea to avoid excessive looping and conditionals.

Eventually I got dragged back into the GPU world. I collaborated on a  paper on graph coloring which was very instructive in demonstrating the differences between a GPU and straight up parallel machine. I also did a series of lectures on GPU algorithms at a 2012 MADALGO summer school on new models of computing. Incidentally, folks at Utah spend a lot of time working on issues relating to GPUs: Mary Hall's group has some nifty autotuning tools for mapping algorithms to the GPU, and Ganesh Gopalakrishnan has been thinking about memory consistency issues for GPUs.

Is there a good "GPU model" that theoreticians can study ? I think this is the wrong question. I think the GPU is one point (multi-core CPUs are another) in a space of tradeoffs between local computation, memory latency, and communication. I've been saying for a while now that communication appears to be the key resource to understand theoretically when dealing with our new networked/distributed/multiprocessor world. GPU algorithms are interesting to study as one example of how communication affects a computation. But the "right" way to deal with this involves focusing on communication, and not the GPU model per se. The latter has many idiosyncracies and what I'd consider distractions that take away from a clean formal understanding.

I should add that I'm not even the only person from theory-land who's been thinking about GPU-derived models. There's a great body of work on multicore models (see Phil Gibbons' lecture notes at the same summer school), and I recall Michael Mitzenmacher having recent work on GPU-accelerated hashing. Uzi Vishkin had a STOC workshop a few years ago on multicore models as well. (Turing Award winner) Valiant's bridging model clearly needs to get a lot more attention as well so that people don't think no one's paying attention to the models.

For more on this, also see my colleague Jeff Phillips' lecture notes on GPUs from his class on new models of computation.

Saturday, March 23, 2013

Free, Freemium, and Paid

There was a time when I'd bridle at the idea of having to pay for software or services. But I browse the iTunes app store now, and see people pleading to have the chance to pay for an app that they like, so that the authors won't stop updating it. The whole kerfuffle with Google Reader, Keep and Evernote is another example of how people have begun to prefer to pay for products, rather than rely on something free.

It feels like the end of an era where open source and free software (not the same thing, but often referred to in the same breath) were the default. Maybe we've come to the realization that nothing is really ever free, and that it's more realistic to get the costs out in the open rather than "being the product".

Friday, March 15, 2013

The SIGACT CG column

As you might have just discovered, I'm the second half of the two-headed monster that's taken over the SIGACT Geometry Column after Joe O'Rourke stepped down (Adrian Dumitrescu, who hopefully does not mind being referred to as the head of a monster, is the other half). My first column is up, and it talks about some recent fascinating developments in nonnegative matrix factorization.

My hope with the pieces I write is to cover areas of geometry that may have not had sufficient representation in the past (especially things closer to problems I'm interested in). My next column is due in August, and apart from doing a wrapup on SoCG, other things that come to mind include Laplacians and graph geometry, reproducing kernels, or even Bregman geometry.

But everything is now mobile, crowdsourced and social networked, so I'm looking for your suggestions on interesting topics to cover, new emerging areas, results that I'm not tracking, and so on. So post away here or on G+.

Monday, February 25, 2013

Data analysis, interpretation and explanations

There was a recent data-related kerfuffle between the New York Times and the makers of the Tesla electric car. If you haven't read the articles, (and the NYT public editor's post on this has good links), the crude summary is this:

  • NYT reporter takes Tesla on long test drive, and reports problems with running out of charge.
  • Tesla Motors CEO Elon Musk writes a long takedown of the reporter's review, complete with graphs generated from data the car recorded during the drive
  • NYT comes back with their own interpretation of data, rebutting Musk's claims.
  • Others attempt to reproduce the reporter's experience and fail, but arguably in different weather conditions that might or might not make a difference.
In an insightful meta-analysis of the dustup, Taylor Owen of the Tow Center for Digital Journalism discusses the implications for journalism in a data-driven world. He also references an article by David Brooks that makes the point:
People are really good at telling stories that weave together multiple causes and multiple contexts. Data analysis is pretty bad at narrative and emergent thinking...
I've felt for a while now that it's time to design mechanisms for providing context and narrative to data analysis*. Some of the research my student Parasaran does is on metaclustering: essentially the meta-analysis of different perspectives (clusterings) of data to draw out a larger meaning. We've also just submitted a paper on how to derive local markers of prediction validity in clustering (rather than just relying on global measures of quality). And my seminar this semester is about the larger problems of explanations and accounting in data mining.

I think as computer scientists, we have a lot to offer in the realm of data mining - not just in the design of tools for prediction, but in the design of tools to facilitate better understanding.


* None of this is surprising to experimental scientists, who will routinely attack a problem from multiple directions in order to acquire a larger understanding of a phenomenon rather than just the ability to predict. 


Friday, February 08, 2013

TCS+ hangout on the TSP polytope.

+Thomas Vidick  and the folks at the +TCS+ community have started a new experiment in G+ hangout talks. The first talk in this series was by Ronald de Wolf on the STOC 2012 paper that showed that there was no polynomial-sized lifted representation of the TSP polytope.

Overall, it was a very pleasant experience. I was able to reserve one of the 10 slots (a Hangout limit) for myself and my students at the University of Utah to attend and interact with the speaker, and from Thomas's post-review it seems that many more were signed on for "view-only" access to the stream.

There were very few technical hiccups, and +Oded Regev did a great job making sure that people were muted when not talking, and had a chance to ask questions in an orderly way. The chat sidechannel was a great help.

The talk itself was the best part: de Wolf did an excellent job conveying the main ideas of the proof without getting bogged down in details, and it felt as comfortable as listening to a talk live at a conference. Given the number of people listening in, this was already approaching medium-sized-workshop levels.

I'm looking forward to more of these events, and I'm glad that the +TCS+  folks are doing this. I also hope they can try more experiments with Google Hangout. For example, two ideas come to mind:

  • A reddit-style AMA ("Ask Me Anything"). One way to make this work is that the speaker would do a short presentation (maybe 5-10 minutes) and then would open up the floor for questions. To keep things manageable, people could write in questions on chat, and the moderator could filter them and ask the questions live. With sufficient preparation, some questions could be submitted ahead of time.
  • A live panel discussion with a moderator and a few participants, which again could have questions from the audience moderated by the moderator.

Monday, January 21, 2013

Accountability in Data Mining: A new seminar

Glencora Borradaile makes a number of interesting points about the implications of Aaron Swartz's life and work for us academic computer scientists. 
As computer science academics we are in a very powerful position. We are trusted with shaping the next generation that will make very important decisions that will have far-reaching social implications. Decisions like those over Facebook’s privacy defaults, motivating technology that enables autonomous private vehicles at the expense of the public interest, defining ownership of electronic media. We make those decisions ourselves in our research; what we research, how we allow our research to be used.
 Whether or not you agree with her particular policy preferences, the point remains that the technologies we develop can have lasting consequences for the "shape" of the world to come. And if we don't speak up (either through our work, or through our own advocacy), others will take the technologies we develop and find their own ways of using or misusing them.

All of this leads up to my current interests in data mining. I've been thinking about the problems of accountability in data mining for a while now: initially mostly in private, but now more and more in public (along with +Graham Cormode and +Andrew McGregor) as I see the importance of the issue.

What is accountability in data mining ? It means many things really, but for me, for now, it means this:

The fruits of data mining pervade every aspect of our life. We are recommended books and movies; given differential pricing for insurance; screened for potential terror threats; diagnosed with various diseases; and targeted for political advertising. The ability to sift through massive data sets with sophisticated algorithms has resulted in applications with impressive predictive power. 
Since the internals of a learning process can be complex and opaque, it is important to verify that the results satisfy the properties claimed by the algorithm. The importance of this goes beyond merely checking the algorithm itself. Validation mechanisms also provide a way for users to demand accountability from authorities who might use the results of mining to affect users in some way (by changing their insurance rates, or even putting them on a terrorism watchlist). As the results of data mining affect more and more of our lives, the more crucial it is that the user be able to validate decisions made on their behalf and that affect them. 
The above is an introduction to a seminar I'm running this semester on this topic. I'm a little nervous about it, because the topic is vast and unstructured, and almost anything I see nowadays on data mining appears to be "in scope". I encourage you to check out the outline, and comment on topics you think might be missing, or on other things worth covering. Given that it's a 1-credit seminar that meets once a week, I obviously can't cover everything I'd like, but I'd like to flesh out the readings with related work that people can peruse later.

It's entirely possible that we don't care about our privacy any more (as Facebook seems to think*). But we will care about the consequences of the use of our data. My perspective is that a better understanding of what is technically possible (and not possible) to demand and get accountability will be critical to informing larger policy discussions on this topic.

* In an earlier version of this post I wrongly attributed this sentiment to an individual. Apologies.

Disqus for The Geomblog