Tuesday, July 24, 2007

STOC/FOCS/SODA: The Cage Match (with data!)

(Ed: This post, and the attendant web page and data, was initiated as a joint effort of Piotr Indyk and myself. Many others have helped improve the data, presentation and conclusions.)

Inspired by Michael Mitzenmacher's flamebait post on SODA/STOC/FOCS, we decided to roll up our sleeves, and resolve the biggest outstanding issue in Theoretical Computer Science, namely the great "STOC/FOCS vs. SODA" debate ("P vs. NP" is a tainted second, sullied by all that money being offered to solve it). We have some interesting preliminary observations, and there are many interesting problems left open by our work ;)

The hypothesis:
There is a significant difference in citation patterns between STOC/FOCS and SODA
The plan:

First, we obtained the list of titles of conference papers appearing in STOC, FOCS and SODA in the last 10 years (1997-2006). We deliberately excluded 2007 because FOCS hasn't happened yet. We got this list from DBLP (Note: DBLP does not make any distinction between actual papers and tutorials/invited articles; we decided to keep all titles because there weren't that many tutorials/invited papers in any case).

For each title, we extracted the citation count from Google Scholar, using a process that we henceforth refer to as "The Extractor". Life is too short to describe what "The Extractor" is. Suffices to say that its output, although not perfect, has been verified to be somewhat close to the true distribution (see below).

The results, and methodology, can be found at this link. The tables and graphs are quite self-explanatory. All source data used to generate the statistics are also available; you are welcome to download the data and make your own inferences. We'll be happy to post new results here and at the webpage.

OBSERVATIONS:

The main conclusion is that the hypothesis is valid: a systematic discrepancy between citation counts of SODA vs. STOC/FOCS does appear to exist. However, the discrepancy varies significantly over time, with years 1999-2001 experiencing the highest variations. It is interesting to note that 1999 was the the year when SODA introduced four parallel sessions as well as the short paper option.

Although most of the stats for STOC and FOCS are quite similar, there appears to be a discrepancy at the end of the tail. Specifically, the 5 highest citation counts per year for STOC (years 1997-2001) are all higher than the highest citation count for FOCS (year 2001). (Note: the highest cited STOC article in 2001 was Christos Papadimitriou's tutorial paper on algorithms and game theory). The variation between SODA and STOC/FOCS in the 1999-2001 range shows up here too, between STOC and FOCS themselves. So maybe it's just something weird going on these years. Who knows :)

Another interesting observation comes from separating the SODA cites into long and short paper groups (for the period 1999-2005). Plotting citations for short vs long papers separately indicates that the presence of short papers caused a net downward influence on SODA citation counts, but as fewer and fewer shorts were accepted, this influence decreased.

There are other observations we might make, especially in regard to what happens outside the peak citations, but for that we need more reliable data. Which brings us to the next point.

VALIDATION OF THE DATA:

To make sure that the output makes sense, we performed a few "checks and balances". In particular:
  • we sampled 10 random titles from each of FOCS, STOC and SODA for each of the 10 years, and for each title we checked the citation count by hand. Results: there were 7 mistakes in FOCS, 9 in STOC, and 11 in SODA, indicating a current error rate in the 9-10% range.
  • for each of FOCS, STOC, SODA, we verified (by hand) the values of the top 10 citation numbers, as reported by The Extractor
  • we compared our stats for the year 2000 with the stats obtained by Michael. The results are pretty close:





Median (Mike's/Ours)Total (over all 10 years) (Mike's/Ours)
FOCS 38/38 3551/3315
STOC 21/21 3393/2975
SODA 14/13 2578/2520

A CALL FOR HELP:
Warning: the data displayed here is KNOWN to contain errors (our estimate is that around 10% of citation counts are incorrect). We would very much appreciate any efforts to reduce the error rate. If you would like to help:
  1. choose a "random" conference/year pair (e.g., STOC 1997)
  2. check if this pair has been already claimed in this blog; if yes, go to (1)
  3. post a short message claiming your pair (e.g., "CLAIMING STOC 1997") on the blog.
  4. follow the links to check the citations. For each incorrect citation, provide two lines: (1) paper title (2) a Google Scholar link to the correct paper
  5. Email the file to Suresh Venkatasubramanian.
(and yes, we know that this algorithm has halting problems). Many thanks in advance. Of course, feel free to send us individual corrections as well.

Citation guidelines: The world of automatic citation engines is obviously quite messy, and sometimes it is not immediately clear what is the "right" citation count of a paper. The common "difficult case" is when you see several (e.g., conference and journal) versions of the same paper. In this case, our suggestion is that you ADD all the citation counts that you see, and send us the links to ALL the papers that you accounted for.


Acknowledgement: Adam Buchsbaum for suggesting the idea of, and generating data for the short-long variants of SODA. David Johnson, Graham Cormode, and Sudipto Guha for bug fixes, useful suggestions and ideas for further plots (which we can look into after the data is cleaned up)
Post a Comment

Disqus for The Geomblog