Tuesday, May 20, 2014

On beauty and truth in science.

Philip Ball writes a thought-provoking article in Aeon with the thesis that the kind of beauty that scientists describe does not necessarily match the aesthetic notions of art, and is not even consistent among scientists.

It was hard for me to get beyond the casual conflating of beauty in mathematics (the four-color theorem, the proof of Fermat's theorem, and proofs in general) and beauty in scientific theories (relativity, evolution, and so on). But if one goes beyond the artificial duality constructed by the author, the idea of beauty as a driver in science (and mathematics) is a rich one to explore.
A particular example: for a long time (and even codified in books) it was taught that there were five natural classes of approximation hardness: PTAS, constant factor-hard, log-hard, label-cover (superlogarithmic) hard, and near-linear hard. There were even canonical members of each class. 

Of course, this nice classification no longer exists. There are even problems that are $\log^* n$-hard to approximate, and can also be approximated to that factor. And to be fair, I'm not sure how strong the belief was to begin with. 

But it was such a beautiful idea. 

At least in mathematics, the search for the beautiful result can be quite fruitful. It spurs us on to find better, simpler proofs, or even new ideas that connect many different proofs together. That notion of connection doesn't appear to be captured in the article above: that beauty can arise from the way a concept ties disparate areas together. 

MADALGO Summer School on Learning At Scale

I'm pleased to announce that this year's MADALGO summer school (continuing a long line of summer programs on various topics in TCS) will be on algorithms and learning. The formal announcement is below, and registration information will be posted shortly. 

Save the date ! Aug 11-14, 2014.
MADALGO Summer School on 
   August 11- 14, 2014, Aarhus University, Denmark

The MADALGO Summer School 2014 will introduce attendees to the latest developments in learning at scale. The topics will include high dimensional inference,  algorithmic perspectives on learning and optimization, and challenges in learning with huge data. 
The school will be taught by experts in learning:
  • Amr Ahmed (Google)
  • Mikhail Belkin (Ohio State)
  • Stefanie Jegelka (Berkeley)
  • Ankur Moitra (MIT)
The summer school will take place on August 11-14, 2014 at Center for Massive Data Algorithmics (MADALGO) at the Department of Computer Science, Aarhus University, Denmark. The school is targeted at graduate students, as well as researchers interested in an in-depth introduction to Learning. Registration will open soon at the school webpage. Registration is free on a first-come-first serve basis - handouts, coffee breaks, lunches and a dinner will be provided by MADALGO and Aarhus University. 
  • Suresh Venkatasubramanian (University of Utah)
  • Peyman Afshani (MADALGO, Aarhus University)
  • Lars Arge (MADALGO, Aarhus University)
  • Gerth S. Brodal (MADALGO, Aarhus University)
  • Kasper Green Larsen (MADALGO, Aarhus University)
  • Trine Ji Holmgaard (MADALGO, Aarhus University)
  • Katrine Ƙstergaard Rasmussen (MADALGO, Aarhus University)
Center for Massive Data Algorithmics is a major basic research center funded by the Danish National Research Foundation. The center is located at the Department of Computer Science, Aarhus University, Denmark, but also includes researchers at CSAIL, Massachusetts Institute of Technology in the US, and at the Max Planck Institute for Informatics and at Frankfurt University in Germany. The center covers all areas of the design, analysis and implementation of algorithms and data structures for processing massive data (interpreted broadly to cover computations where data is large compared to the computational resources), but with a main focus on I/O-efficient, cache-oblivious and data stream algorithms.

Thursday, May 01, 2014

The history of the vector space model

Gerald Salton is generally credited with the invention of the vector space model: the idea that we could represent a document as a vector of keywords and use things like cosine similarity and dimensionality reduction to compare documents and represent them.

But the path to this modern interpretation was a lot twistier than one might think. David Dubin wrote an article in 2004 titled 'The Most Influential Paper Gerard Salton Never Wrote'. In it, he points out that most citations that refer to the vector space model refer to a paper that doesn't actually exist (hence the title). Taking that as a starting point, he then traces the lineage of the ideas in Salton's work.

The discoveries he makes are quite interesting. Among them,

  • Salton's original conception of the vector space model was "operational" rather than mathematical. In other words, his earliest work really uses 'vector space' to describe a collection of tuples, each representing a document. In fact, the earliest terminology used was the 'vector processing model'. 
  • In later papers, he did talk about things like orthogonality and independence, as well as taking cosines for similarity, but this was done in an intuitive, rather than formal manner. 
  • It was only after a series of critiques in the mid 80s that researchers (Salton included) started being more direct in their use of the vector space model, with all its attendant algebraic properties. 
Of course today the vector space model is one of the first things we learn when doing any kind of data analysis. But it's interesting to see that it didn't start as this obvious mathematical representation (that I've taken to calling the reverse Descartes trick). 

Disqus for The Geomblog