There’s this really cool idea for learning nonlinear functions with simple techniques, launched quite a while ago by Rahimi and Recht (2007), and which I haven’t fully grasped. Yet, the idea of random projections is becoming more popular these days; the problem of learning higher order functions simply comes up in so many places.

Suppose there are $N$ data points $\{x_i,y_i\}$ where $x_i$ has $d$ covariates. In one case of the problem, the task is to train a classifier $f$ via

where $f$ has the functional form

where the second equality is true by the Representer Theorem, where $\phi:X\times\Omega\to\mathbb{R}$ is known as a feature function such that $\phi(x)^T\phi(y) = k(x,y)$.

This says that one can avoid training the kernel machine directly, which requires computing a $n\times n$ matrix comprised of kernel entries $k(x_i,x_j)$; instead, train in the feature space $\phi(\cdot)$. This feature space is infinite-dimensional, so we approximate it with one of finite dimension $k$:

Thus, given $w_i$’s and function $z$, the task reduces to performing linear regression over the $z$’s. We form this finite-dimensional representation using random features to generate $w_i$’s and whose distribution induces a functional form for the $z$’s.

Random Fourier features.

• Generate a random $k\times d$ matrix $W$, e.g., for each entry $W_{i,j}\sim\mathcal{N}(0,1)$.
• Compute the $k\times N$ feature matrix $Z$, where entry $Z_{i,j}$ is the $i^{th}$ feature map on the $j^{th}$ data point

This implies

• Perform linear regression: $\mathrm{Regress}(y \sim Z)$, e.g., $\alpha = (Z^TZ)^{-1}Zy\in\mathbb{R}^k$.

The result is an approximation to the classifier with the Gaussian RBF kernel.

As confused as I am why this works? There’s an old 2009 video of Rahimi presenting the idea in more generality, and also a short overview of the paper in context of previous literature available here.

To motivate this from a statistics and/or signal processing point of view, Guhaniyog and Dunson (2013) have this really cool paper which uses it for regression in Bayesian statistics.

A table of training time and storage, and test time and storage for this procedure is found in Le et al. (2014; Table 1), which itself is a recent extension which makes the procedure even more easy to store and approximate the Gaussian RBF kernel within some $\epsilon$ accuracy. For a more direct connection to Gaussian processes, a related paper (and the most recent paper related to these ideas) is Yang et al. (2015).

• Rajarshi Guhaniyogi and David B. Dunson. Bayesian Compressed Regression. arXiv preprint: arXiv:1303.0642, 2013.
• Quoc Viet Le, Tamas Sarlos, and Alexander Johannes Smola. Fastfood: Approximate Kernel Expansions in Loglinear Time. In International Conference on Machine Learning, 2013.
• Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Neural Information Processing Systems, 2007.
• Ali Rahimi and Benjamin Recht. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning. In Neural Information Processing Systems, pages 1–8, October 2009.
• Zichao Yang, Alex Smola, Song Le, and Andrew Gordon Wilson. A la Carte—Learning Fast Kernels. In Artificial Intelligence and Statistics, 2015.