Wednesday, April 04, 2007

Inventor of 'idf' passes on

Karen Sparck-Jones, professor at Cambridge University, winner of the Lovelace Medal and the Allen Newell Award (among other awards), died today. She was a pioneer in NLP and information retrieval, and is also the inventor of one of the most fundamental notions in information retrieval: the concept of IDF (or inverse document frequency).

Information retrieval methods are at the heart of web search today, even in the age of link structures and PageRank. One of the most elementary and yet most enduring ideas in information retrieval is the concept of a document as a "bag of words"; namely that even if we encoded a document as a (weighted) set of terms occuring in it, we can extract meaningful information from it. The 'bag-of-words' model itself dates back to at least to Gerald Salton's IR book in 1968; it is probably much older.

Once you think of documents as bags of "terms", you need a way of weighting these terms in a way that allows "similar" documents to have a similar set of weights. Term frequency (TF) is one way of doing this: weight a term with a number proportional to how often it appears. This makes some sense; a document about soccer might use the word 'football' or 'soccer' fairly often, as compared with a document on cricket.

However, words like 'a', 'an' and 'the' also appear very often in a document. How does one prevent these terms from washing away any semantic context one might have ? One idea is to weight the term frequency by something called 'Inverse Document Frequency' (IDF). For a given collection of documents, count in how many documents a fixed term appears. This number is the term's document frequency. Clearly, a term that appears in all documents can't be a useful distinguisher of a document and so the larger this number is, the less relevant the term. Therefore, by multiplying the term frequency by a function inversely related to the document frequency (if p is the fraction of documents a term appears in, the IDF is often set equal to log(1/p)), you get a more accurate estimate of the importance of the term.

This to me is a very modern way of dealing with documents, and yet the idea itself was proposed by Karen Sparck-Jones back in 1972, well before we could imagine search engines, and fast computers that could crunch these numbers in a jiffy. I'm not an expert on IR methods, but I dare say that a simple TF-IDF computation is, even today, a first preprocessing step when organizing document collections.

(Via Hal Daume)
Post a Comment

Disqus for The Geomblog