Is there a proof of the Johnson-Lindenstrauss Lemma that can be explained to an undergraduate ?While there are different "simple" proofs of the JL Lemma (the Gupta-Dasgupta proof is one such), it's not clear whether these are "undergraduate" level. So instead of answering his original question, I decided to change it to
Is there a proof of the JL Lemma that isn't "geometric" ?It's a little odd to even ask the question, considering the intrinsic geometric nature of the lemma. But there's a reasonably straightforward way of seeing how the bound emerges without needing to worry too much about random rotations, matrices of Gaussians or the Brunn-Minkowski theorem.
Warning: what follows is a heuristic argument that helps suggest why the bound is in the form that it is: it should not be confused for an actual proof.
In its original form, the JL Lemma says that any set of $n$ points in $R^d$ can be embedded in $R^k$ with $k = O(\log n/\epsilon^2)$ such that all distances are preserved to within a $1+\epsilon$ factor. But the real result at the core of this is that there is a linear mapping taking a unit vector in $R^d$ to a vector of norm in the range $1\pm \epsilon$ in $R^k$, where $k = 1/\epsilon^2$ (the rest follows by scaling and an application of the union bound).
Trick #1: Take a set of values $a_1, \ldots, a_n$ and set $Y = \sum_i a_i r_i$, where $r_i$ is chosen (iid) to be +1 or -1 with equal probability. Then $E[Y^2] = \sum a_i^2$.This can be verified by an easy calculation.
So now consider the vector $v$. Let's assume that $v$'s "mass" is roughly equally distributed among its coordinates. Take a random sample of $d/k$ of the coordinates of $v$ and apply the above trick to the values. Under the above assumption, the resulting $Y^2$ will have roughly $1/k$ of the total (squared) mass of $v$. Scale up by $k$.
This is one estimator of the norm of $v$. It is unbiased and it has a bounded maximum value because of the assumption. This means that we can apply a Chernoff bound over a set of $k$ such estimators. Roughly speaking, the probability of deviation from the mean is $\exp(-\epsilon^2 k)$, giving the desired value of $k$.
But how do we enforce the assumption ? By applying a random Fourier transform (or actually, a random Hadamard transform). this "spreads" the mass of the vector out among the coordinates (technically by ensuring an upper bound on the $\ell_\infty$ norm).
That's basically it. Almost all the papers that follow the Ailon-Chazelle work proceed in this manner, with increasing amounts of cleverness to reuse samples, or only run the Fourier transform locally, or even derandomize the process. What distinguishes this presentation of the result from the earlier approaches (which basically boil down to: populate a matrix with entries drawn from a distribution having subGaussian tails) is that it separates the "spreading" step (called the preconditioner) from the latter, more elementary step (the sampling of coordinates). It turns out that in practice the preconditioner can often be omitted without incurring too much error, yielding an extremely efficient (and sparse) linear transform.