"Whatever comes in sufficiently large quantities commands the general admiration." Trurl the Constructor, from Stanislaw Lem's Cyberiad.I've been reading Malcolm Gladwell's masterful article on the Enron scandal, and he frames it with the device of 'puzzles' vs 'mysteries':
There is a fundamental problem that comes up when you start messing with "data". Our training in algorithms makes us instinctively define a "problem" when working with data, or any kind of applied domain. Many of the problems in clustering, like k-center, k-median, k-means, or what-have-you, are attempts to structure and organize a domain so we can apply precise mathematical tools.
The national-security expert Gregory Treverton has famously made a distinction between puzzles and mysteries. Osama bin Laden’s whereabouts are a puzzle. We can’t find him because we don’t have enough information. The key to the puzzle will probably come from someone close to bin Laden, and until we can find that source bin Laden will remain at large.
The problem of what would happen in Iraq after the toppling of Saddam Hussein was, by contrast, a mystery. It wasn’t a question that had a simple, factual answer. Mysteries require judgments and the assessment of uncertainty, and the hard part is not that we have too little information but that we have too much. The C.I.A. had a position on what a post-invasion Iraq would look like, and so did the Pentagon and the State Department and Colin Powell and Dick Cheney and any number of political scientists and journalists and think-tank fellows. For that matter, so did every cabdriver in Baghdad. [....]If things go wrong with a puzzle, identifying the culprit is easy: it’s the person who withheld information. Mysteries, though, are a lot murkier: sometimes the information we’ve been given is inadequate, and sometimes we aren’t very smart about making sense of what we’ve been given, and sometimes the question itself cannot be answered. Puzzles come to satisfying conclusions. Mysteries often don’t.
In a sense, we treat these problems like puzzles to be solved. The game is then to find the best solution, the fastest, the most accurate; but the structure of the puzzle has been set. We can change the game (and we often do), but once again, the goal is to crack the puzzle.
But when you get down and dirty with data, you start seeing the problems that Gladwell describes. If your goal is to "understand" the data, then more is not necessarily better, and causes more confusion, and what you need is interpretative skills, rather than number-crunching or even problem solving skills.
This is what makes data mining so hard and exasperating, and yet so important. The need is clearly there, and there are mysteries to mine. But we've been attacking data mining problems as puzzles, and realizing fairly quickly that solving a puzzle doesn't reveal the mystery of the data.
I've often likened data mining research to an ooze; it's thin and spreads horizontally, without too much depth. But I think that's because the puzzles that we solve are of limited range, and not terribly deep. What we seem to need more are interpretative frames rather than algorithmic frames; frames that tell us about invariances in the data, rather than about quirks of representations.