There’s this really cool idea for learning nonlinear functions with simple techniques, launched quite a while ago by Rahimi and Recht (2007), and which I haven’t fully grasped. Yet, the idea of random projections is becoming more popular these days; the problem of learning higher order functions simply comes up in so many places.

Suppose there are NN data points {xi,yi}\{x_i,y_i\} where xix_i has dd covariates. In one case of the problem, the task is to train a classifier ff via

minL(f;X,y)min1Ni=1N(f(xi)yi)2\displaystyle {\min \mathcal{L}(f; X,y)\equiv \min \frac{1}{N}\sum_{i=1}^N (f(x_i)- y_i)^2}

where ff has the functional form

f(x)=i=1Nαik(x,xi)=i=1α(wi)ϕ(x;wi)\displaystyle {f(x) = \sum_{i=1}^N \alpha_i k(x, x_i) = \sum_{i=1}^\infty \alpha(w_i)\phi(x; w_i)}

where the second equality is true by the Representer Theorem, where ϕ:X×ΩR\phi:X\times\Omega\to\mathbb{R} is known as a feature function such that ϕ(x)Tϕ(y)=k(x,y)\phi(x)^T\phi(y) = k(x,y).

This says that one can avoid training the kernel machine directly, which requires computing a n×nn\times n matrix comprised of kernel entries k(xi,xj)k(x_i,x_j); instead, train in the feature space ϕ()\phi(\cdot). This feature space is infinite-dimensional, so we approximate it with one of finite dimension kk:

f(x)i=1kα(wi)z(x;wi),k(x,y)=ϕ(x)Tϕ(y)z(x)Tz(y)\displaystyle {f(x) \approx \sum_{i=1}^k \alpha(w_i) z(x; w_i),\quad k(x,y) = \phi(x)^T\phi(y) \approx z(x)^Tz(y)}

Thus, given wiw_i’s and function zz, the task reduces to performing linear regression over the zz’s. We form this finite-dimensional representation using random features to generate wiw_i’s and whose distribution induces a functional form for the zz’s.

Random Fourier features.

  • Generate a random k×dk\times d matrix WW, e.g., for each entry Wi,jN(0,1)W_{i,j}\sim\mathcal{N}(0,1).
  • Compute the k×Nk\times N feature matrix ZZ, where entry Zi,jZ_{i,j} is the ithi^{th} feature map on the jthj^{th} data point
Zij=zi(xj;wi)exp(iwiTxj)\displaystyle {Z_{ij}=z_i(x_j; w_i) \equiv \exp(i\cdot w_i^Tx_j)}

This implies

minL(f;X,y)minαRki=1k(ziTαyi),k(x,y)z(x)Tz(y)\displaystyle {\min \mathcal{L}(f; X,y) \approx \min_{\alpha\in\mathbb{R}^k} \sum_{i=1}^k (z_i^T\alpha - y_i), \quad k(x, y) \approx z(x)^Tz(y)}
  • Perform linear regression: Regress(yZ)\mathrm{Regress}(y \sim Z), e.g., α=(ZTZ)1ZyRk\alpha = (Z^TZ)^{-1}Zy\in\mathbb{R}^k.

The result is an approximation to the classifier with the Gaussian RBF kernel.

As confused as I am why this works? There’s an old 2009 video of Rahimi presenting the idea in more generality, and also a short overview of the paper in context of previous literature available here.

To motivate this from a statistics and/or signal processing point of view, Guhaniyog and Dunson (2013) have this really cool paper which uses it for regression in Bayesian statistics.

A table of training time and storage, and test time and storage for this procedure is found in Le et al. (2014; Table 1), which itself is a recent extension which makes the procedure even more easy to store and approximate the Gaussian RBF kernel within some ϵ\epsilon accuracy. For a more direct connection to Gaussian processes, a related paper (and the most recent paper related to these ideas) is Yang et al. (2015).

References

  • Rajarshi Guhaniyogi and David B. Dunson. Bayesian Compressed Regression. arXiv preprint: arXiv:1303.0642, 2013.
  • Quoc Viet Le, Tamas Sarlos, and Alexander Johannes Smola. Fastfood: Approximate Kernel Expansions in Loglinear Time. In International Conference on Machine Learning, 2013.
  • Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Neural Information Processing Systems, 2007.
  • Ali Rahimi and Benjamin Recht. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning. In Neural Information Processing Systems, pages 1–8, October 2009.
  • Zichao Yang, Alex Smola, Song Le, and Andrew Gordon Wilson. A la Carte—Learning Fast Kernels. In Artificial Intelligence and Statistics, 2015.