There’s this really cool idea for learning nonlinear functions with simple techniques, launched quite a while ago by Rahimi and Recht (2007), and which I haven’t fully grasped. Yet, the idea of random projections is becoming more popular these days; the problem of learning higher order functions simply comes up in so many places.

Suppose there are data points where has covariates. In one case of the problem, the task is to train a classifier via

where has the functional form

where the second equality is true by the Representer Theorem, where is known as a feature function such that .

This says that one can avoid training the kernel machine directly, which requires computing a matrix comprised of kernel entries ; instead, train in the feature space . This feature space is infinite-dimensional, so we approximate it with one of finite dimension :

Thus, given ’s and function , the task reduces to performing linear regression over the ’s. We form this finite-dimensional representation using random features to generate ’s and whose distribution induces a functional form for the ’s.

Random Fourier features.

  • Generate a random matrix , e.g., for each entry .
  • Compute the feature matrix , where entry is the feature map on the data point

This implies

  • Perform linear regression: , e.g., .

The result is an approximation to the classifier with the Gaussian RBF kernel.

As confused as I am why this works? There’s an old 2009 video of Rahimi presenting the idea in more generality, and also a short overview of the paper in context of previous literature available here.

To motivate this from a statistics and/or signal processing point of view, Guhaniyog and Dunson (2013) have this really cool paper which uses it for regression in Bayesian statistics.

A table of training time and storage, and test time and storage for this procedure is found in Le et al. (2014; Table 1), which itself is a recent extension which makes the procedure even more easy to store and approximate the Gaussian RBF kernel within some accuracy. For a more direct connection to Gaussian processes, a related paper (and the most recent paper related to these ideas) is Yang et al. (2015).

References

  • Rajarshi Guhaniyogi and David B. Dunson. Bayesian Compressed Regression. arXiv preprint: arXiv:1303.0642, 2013.
  • Quoc Viet Le, Tamas Sarlos, and Alexander Johannes Smola. Fastfood: Approximate Kernel Expansions in Loglinear Time. In International Conference on Machine Learning, 2013.
  • Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Neural Information Processing Systems, 2007.
  • Ali Rahimi and Benjamin Recht. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning. In Neural Information Processing Systems, pages 1–8, October 2009.
  • Zichao Yang, Alex Smola, Song Le, and Andrew Gordon Wilson. A la Carte—Learning Fast Kernels. In Artificial Intelligence and Statistics, 2015.