+John Regehr and +Shriram Krishnamurthi have discussions (John, Shriram) arising from a new paper on reproducibility in computer systems research from people at the U. of Arizona. You should read their posts/G+ discussions for the specifics.
But it got me thinking: is there anything intrinsically different about reproducibility in my neck of the woods ? Many concerns are shared:
- The need for a standard computing environment to build and test code: MATLAB and scikit-learn make that a lot easier for us: we don't need to worry too much about details of the computing environment/hardware just to get the code to run.
- Discrepancies between what is stated in the paper and what the code actually does: that's a problem with any experimental work: John in fact mentions one such issue in one of his recent papers.
- Code rot: otherwise known as 'graduation' :). I think if more people used public repos like github from the beginning, some of these problem might go away. I would be remiss if I didn't also plug +Robert Ricci's new system AptLab for helping distribute research.
But here's one that may not be shared: data preparation.
It's no secret that data preparation takes most of the time in a data mining pipeline. This includes
- data cleaning (removing errors, reformatting data)
- feature engineering (deciding how to truncate, modify or retain certain features)
- training (cross-validation levels, what data is used for model building and so on)
As with any of the above, a proper specification can ensure true reproducibility, but there are lots of things we might do to data without necessarily even realizing it (if there are bugs in the transformation pipeline) or without thinking we need to mention it (truncating ranges, defining null values away).
Feature engineering can also be a problem, especially with models that use many features (for example, deep learning systems have lots of "knobs" to tune, and they can be quite sensitive to the knobs).
So one thing I often look for when reviewing such papers is sensitivity: how well can the authors demonstrate robustness with respect to the parameter/algorithm choices. If they can, then I feel much more confident that the result is real and is not just an artifact of a random collection of knob settings combined with twirling around and around holding one's nose and scratching one's ear.