Sunday, April 30, 2006

Inadvertent Hashing and CDDB

Most people use the CDDB either knowingly or unknowingly. It's the backend to most computer CD players that downloads track/title information from a central site when you're playing a CD. Most player software allows you to add in information if CDDB doesn't return a hit, and I've done this every now and then. But overall I have been amazed by the hit rate of CDDB.

I assumed that CDs contained all the track information, and many do. But apparently, CDDB exploits an inadvertent hashing mechanism: the sequence of track times.
The CDDB database has information that allows your computer to identify a particular music CD in the CD drive and list its album title and track titles. Their service is used by RealJukebox, MusicMatch, WinAmp, and others. The title information is not stored on most CDs. The only information in the CD data is the number of tracks (songs) and the length of each. This is the information your CD player displays. What CDDB does is let the software on your PC take that track information, send a CD signature to CDDB through Internet protocols (if you're connected) and get back the titles. It works because songs are of relatively random length. The chances are good almost all albums are unique. (Figure there are about 10 songs on an album, and they each run from a minute and a half or so to three and a half minutes long, so the times vary by 100 seconds. There are 100x100x...x100 = 100**10 = 10**11 = 1 hundred billion = an awful lot of possible combinations.) An album is identified by a signature that is a special arithmetic combination of the times of all the tracks.

You'd figure that CDDB just bought a standard database with all the times and titles. Well, there wasn't one. What they did was accept Internet-relayed postings with the track timing information and the titles typed in by a volunteer.


Categories
Post a Comment

Disqus for The Geomblog