The following is a summary of a talk I gave to undergraduates interested in going on to graduate school. It's targeted at the layperson, and tries to convey a sense of the interplay between data mining and geometry. I gave this talk partly because I realized that things that we take utterly for granted in the rarified world of high dimensional data mining are completely foreign to people who don't think about this for a living.
tl;dr: High dimensional geometry (and non standard geometry) is the unifying language that binds data mining problems together.
We're trying to find patterns all over the place, with very different kinds of data. A search engine is trying to find patterns in web pages that might indicate that they have similar content. Brain researchers are throwing MRIs of patients with diseases into an algorithm that attempts to infer patterns in brain scans that might yield clues about pathology and diagnosis. Genome-wide analysis takes what are essentially long strings of letters and tries to explain why certain populations might be susceptible to certain diseases.
Pandora, Shazam and other music sites analyze fragments of music to find related artists, or even just match a tune. While an infinite gangnam-style video might be a frivolous use of data mining on video streams, video scans are being analyzed by robots trying to drive unmanned cars or even just play soccer. Social networks are all the rage: Facebook, Twitter and Google are desperate to understand your circle of friends in order to sell things to you more effectively.
How are we able to find patterns, clusters and trends in such different kinds of data ? The key step in all of this is the idea of features. The trick is to describe each object we are studying as a sequence (or set) of features. For example, a feature set for a document could be the number of times each particular word appears. The feature set for an image could be the count of different colors. For a tune, it might be a collection of features identified by hiring artists to list out which of more than 700 characteristics a piece of music has.
And so on, and so on. What this allows us to do is represent each object as a set of features, whether it's a web page, a brain scan, a video, or even a social graph. Once we do that, we no longer have to worry about the original data (well, kind of !), and different types of data are all on an equal footing.
But what now ? Here's where a cool trick that dates back to the 1600s comes in.
I doubt that René Descartes ever heard of the term "data mining". But in a strange way, he's the father of the field. One of Descartes' claim to fame was establishing a link between geometry and algebra. He said that if we wanted to represent points, lines and other shapes, we could do so not abstractly as Euclid and other classical geometers did, but using algebra. A "point" in the plane could be described by two coordinates (x,y), and a line in the plane could be described by the equation y = mx + c.
This is a very simple idea - children learn it in middle school. And yet like most simple ideas, it was revolutionary, and completely changed our perspective on geometry. The unification of algebra and geometry is so powerful and so natural, we use it almost unconsciously today.
But what does Descartes have to do with data mining ?
Remember those features I was telling you about ? That every object can be represented by a sequence of numbers, each number describing some aspect of the object.
Let's do the opposite of what Descartes proposed ! In other words, let's pretend that the objects are actually points in a space. Now this space is not the plane (unless we had only two features). It's a high dimensional space, with one feature for each dimension.
This process is entirely artificial. There is no real reason why we must think of objects as points in a high dimensional space. But here's the cool thing that happens. We can now express all basic mining primitives as geometric concepts, by translating the language of the primitive to this high dimensional space.
A clustering of data becomes a grouping of points so that "nearby" points are grouped together. A classification of reviews into positive and negative statements becomes a way to separate "positive" points and "negative" points by a line. Finding trends in data becomes the problem of fitting a straight line to a collection of points.
It is hard to emphasize how utterly bizarre this is. There is no underlying "geometry" in the problem we're solving. We're essentially creating a castle out of thin air in order to understand the problem. And yet it works, and is the bedrock of how we think about data mining.
But wait ! there's more. What exactly does it mean to say that points are "nearby" or they are "separated" ? To answer this question, it's not enough to view objects as points in a high dimensional space. You have to give this space a shape - a geometry (and also a topology, but I'll skip that for now).
For example, if I have two feature lists, how do I measure the distance between them ? If they were points on a map, I could do the usual straight line distance. But does that really capture how far apart they are ? After all, if you're in downtown Salt Lake with its grids, a "crow flies" distance doesn't help estimate how far things are. If you're on the surface of the earth, you can't really tunnel through the center to get from one point to another.
So we have to create the shape of the space by defining how far apart two points are. And this is one of the trickiest parts of data mining. Either we have to use some domain knowledge to estimate which features are significant and control more of the distance between objects, or we have to try and learn what seems like the right distance between objects based on user estimates or other information.
The good thing is that we have a huge collection of shapes to play with, and different shapes tend to work well for different classes of objects. Some are easy to interpret, others are easy to compute, and so on. So a good part of the "art" in data mining is understanding the data and estimating the right geometry for it, so that our tasks (expressed geometrically) give us meaningful answers.
Or as Dorothy famously said to her dog, "Toto, I don't think we're in the plane any more!"