Sunday, July 17, 2005

The Good-Turing Estimator

Suppose you're sampling from a distribution, say of English words. You see a word, update its count, and repeat for a number of trials. After some trials, you will have counts that give you some idea of the frequency of words, and the simple process of normalizing the word counts by dividing by the total number of samples yields a distribution, the maximum likelihood estimator, that represents your best guess of the original distribution.

But is it ? The problem with this process is that it assigns all probabilities to things you have already seen. Any element you haven't seen is given a probability of zero, which seems wrong, as well as implying that you will never predict seeing a new element if you rely on the estimated distribution to predic tthe next item.

This seemingly dry problem has a rather colorful history. Back during World War II, the Allied forces had a crack team of cryptographers dedicated to decoding Enigma, the German code used to transmit messages to German commanders in the field. Alan Turing was heavily involved with this effort, which was fictionalized quite dramatically in Neal Stephenson's book, "Cryptonomicon".

At the time, British forces had captures the codebook used by the Germans to generate keys for their messages, and the problem that I. J. Good and Turing faced was how to determine the distribution governing which pages of the books were chosen. Messages that they decrypted told them which page a particular message had come from ("the sample").

I'll get to their solution in a second. But firstly, it's worth noting that this problem was examined long before Good and Turing. Laplace suggested adding one to each count before normalizing, thus guaranteeing that elements not seen had a non-zero probability. More precisely, the probability of seeing a new element was then given by

Pr(new) = (n - n>=1)/(n + s),

where n is the number of elements in the universe, s is the number of samples, and n>=1 is the number of elements seen thus far.

This is called the "add-one" solution for obvious reasons, and other constants were suggested as well (Krichevsky and Trofimov suggested an "add-1/2" solution that has some optimality properties). But in general, these methods were really only suitable for small universe sizes, where the number of samples was likely to be larger than the number of elements.

Good and Turing proposed an estimator that, in a bizarre twist, not only include terms relating to the frequency of elements, but also include terms that relate to the frequency of frequency of elements. A general form of the estimator they use is given as follows.

For a given word X, let r be the number of times you've seen X, and let Nr be the number of different words seen exactly r times. Then the probability of seeing X in the next sampling step (after seeing N samples) is

Pr(X) = [(r+1)/N] * [E(Nr)/E(Nr+1)].

Notice how the expected values of the frequencies Nr appear in the expression. Also, glossed over is a rather nontrivial issue of how to compute E(Nr), in a process called smoothing that accounts for many of the Good-Turing variants.

Good later wrote that Turing had some "intuitive demonstration" for the validity of this formula. It has been very successful in natural language processing since then, but a clear understanding of how it works has eluded researchers. The amount of work on this estimator is too voluminous to detail here; it's probably easiest to start backwards, with a recent paper in FOCS by Orlitsky, Santhanam and Zhang, and a slightly older paper in COLT by McAllester and Schapire.

p.s This was prompted by some delightful conversations I had about estimators with Alon Orlitsky and Suhas Diggavi when I was at EPFL visiting Suhas and giving this talk.
Post a Comment

Disqus for The Geomblog