Comparing elastic net to stochastic gradient descent for GLMs

The elastic net [3] provides a regularized objective function that meets a compromise between the two extremes of Lasso [2] and ridge regression. It takes into account both the Bayesian properties of ridge regression and also the need for sparse parameters via the $\ell_1$ penalty from Lasso.

Let $\{(x_i,y_i):~i=1,\ldots,N\}$ be our observed data set where the responses $y_i\in\mathbb{R}$ and the predictors $x_i\in\mathbb{R}^p$ . Simply put, elastic net regularization is defined by solving

$\min_{(\beta_0,\beta)\in\mathbb{R}^{p+1}} R_\lambda(\beta_0, \beta) = \min_{(\beta_0,\beta)\in\mathbb{R}^{p+1}} \left[ \frac{1}{2N} \sum_{i=1}^N (y_i - \beta_0 - x_i^T\beta)^2 + \lambda P_\alpha(\beta) \right],$

where the penalty function is defined as

$P_\alpha(\beta) = (1-\alpha)\frac{1}{2}\|\beta\|_{\ell_2}^2 + \alpha\|\beta\|_{\ell_1}.$

This was applied to generalized linear models (GLMs) by Tibshirani, and also encoded as an R package for optimized calculations [1]. In particular, one can take advantage of the many sparse entries underlying the computation, as the dimension of the corresponding matrices written in sparse column format is effectively smaller; this leads to reduced runtimes of $O(pm)$ as the inner product operations sum only over the nonzero terms ( $m$ being the number of nonzero terms, $p$ the dimension of the predictors).

In one example, Friedman et. al [1] effectively show that this method to compute elastic nets is more efficient and more accurate than LARS, another regression algorithm meant for high-dimensional problems. Statisticians tend to keep to their own methods for computation, so let’s try something different; let’s see what happens when under the same example, we run standard stochastic gradient descent (SGD) and its implicit variant as a comparison.

We simulate data according to the example offered in the paper by Friedman et. al [1]: generate $N$ Gaussian observations with $p$ predictors, where each predictor pair $(X_j,~X_{j'})$ has the same population correlation $\rho$ . Range $\rho$ from $0$ to $0.95$ , and generate outcomes

$Y = \sum_{j=1}^p X_j\beta_j + k\cdot Z,$

where $\beta_j=(-1)^j\exp(-2(j-1)/20)$ , $Z\sim N(0,1)$ , and $k$ is chosen so that the signal-to-noise ratio is $3.0$ .

We use the R package glmnet in order to emulate the times displayed on Table 1 of the paper, with both the “naive” and “covariance” variants of the algorithm. As for SGD and implicit SGD, let $b^2=\rho/(1-\rho)$ , so then the covariance matrix can be rewritten as $\mathbb{E}[xx^T] = b^2U+I.$ Then for computing standard SGD, we shall use the learning rate

$\alpha_n = \frac{\alpha}{\frac{\alpha}{\gamma_0} +n},$

where $\gamma_0=1/\mathrm{trace}(A) = 1/((1+b^2)p)$ and $\alpha$ is the minimal eigenvalue $\lambda_0=1$ . (The latter is true because all eigenvalues are 1 excluding one which is $\lambda=1+pb^2$ . One can prove this easily by $(b^2U+I)x=\lambda x$ iff $(1-\lambda)x = -b^2 S\cdot 1$ , where $S=\sum x_i$ .) As for implicit SGD, we’ll just use $\alpha_n=\alpha/n$ .

The following is run on a 2.6 Ghz i5 core processor.

Runtime (s)

dimension space	method	$$\rho=0$$	$$\rho=0.10$$	$$\rho=0.20$$	$$\rho=0.50$$	$$\rho=0.90$$	$$\rho=0.95$$
$$N=1000,~p=100$$	covariance (glmnet)	$$0.011$$	$$0.010$$	$$0.010$$	$$0.011$$	$$0.015$$	$$0.017$$
	naive (glmnet)	$$0.032$$	$$0.033$$	$$0.037$$	$$0.048$$	$$0.134$$	$$0.177$$
	SGD	$$0.020$$	$$0.021$$	$$0.021$$	$$0.023$$	$$0.019$$	$$0.020$$
	implicit SGD	$$0.020$$	$$0.020$$	$$0.019$$	$$0.019$$	$$0.019$$	$$0.018$$
$$N=5000,~p=100$$	covariance (glmnet)	$$0.035$$	$$0.033$$	$$0.034$$	$$0.037$$	$$0.039$$	$$0.040$$
	naive (glmnet)	$$0.119$$	$$0.121$$	$$0.139$$	$$0.179$$	$$0.490$$	$$0.834$$
	SGD	$$0.118$$	$$0.124$$	$$0.113$$	$$0.112$$	$$0.113$$	$$0.112$$
	implicit SGD	$$0.105$$	$$0.104$$	$$0.105$$	$$0.104$$	$$0.105$$	$$0.104$$
$$N=100,~p=1000$$	covariance (glmnet)	$$0.041$$	$$0.041$$	$$0.045$$	$$0.061$$	$$0.098$$	-
	naive (glmnet)	$$0.023$$	$$0.020$$	$$0.025$$	$$0.032$$	$$0.046$$	-
	SGD	$$0.007$$	$$0.007$$	$$0.006$$	$$0.008$$	$$0.006$$	-
	implicit SGD	$$0.007$$	$$0.007$$	$$0.006$$	$$0.006$$	$$0.006$$	-
$$N=100,~p=5000$$	covariance (glmnet)	$$0.204$$	$$0.238$$	$$0.243$$	$$0.314$$	$$0.517$$	$$0.615$$
	naive (glmnet)	$$0.075$$	$$0.093$$	$$0.071$$	$$0.082$$	$$0.098$$	$$0.145$$
	SGD	$$0.031$$	$$0.031$$	$$0.032$$	$$0.031$$	$$0.029$$	$$0.033$$
	implicit SGD	$$0.029$$	$$0.028$$	$$0.029$$	$$0.028$$	$$0.028$$	$$0.030$$
$$N=100,~p=20000$$	covariance (glmnet)	$$0.923$$	$$0.906$$	$$0.955$$	$$1.313$$	$$2.082$$	$$2.927$$
	naive (glmnet)	$$0.381$$	$$0.377$$	$$0.376$$	$$0.390$$	$$0.405$$	$$0.736$$
	SGD	$$0.152$$	$$0.147$$	$$0.148$$	$$0.156$$	$$0.147$$	$$0.153$$
	implicit SGD	$$0.140$$	$$0.170$$	$$0.146$$	$$0.143$$	$$0.137$$	$$0.145$$
$$N=100,~p=50000$$	covariance (glmnet)	$$2.280$$	$$2.321$$	$$2.322$$	$$2.693$$	$$4.901$$	$$6.176$$
	naive (glmnet)	$$0.745$$	$$0.770$$	$$0.753$$	$$0.775$$	$$0.996$$	$$1.735$$
	SGD	$$0.421$$	$$0.392$$	$$0.379$$	$$0.383$$	$$0.390$$	$$0.393$$
	implicit SGD	$$0.363$$	$$0.370$$	$$0.366$$	$$0.374$$	$$0.368$$	$$0.370$$

Mean Squared Error

dimension space	method	$$\rho=0$$	$$\rho=0.10$$	$$\rho=0.20$$	$$\rho=0.50$$	$$\rho=0.90$$	$$\rho=0.95$$
$$N=1000,~p=100$$	covariance (glmnet)	$$0.045$$	$$0.045$$	$$0.043$$	$$0.048$$	$$0.054$$	$$0.075$$
	naive (glmnet)	$$0.045$$	$$0.045$$	$$0.043$$	$$0.048$$	$$0.054$$	$$0.075$$
	SGD	$$0.054$$	$$0.057$$	$$0.059$$	$$0.062$$	$$0.209$$	$$0.228$$
	implicit SGD	$$0.054$$	$$0.057$$	$$0.058$$	$$0.057$$	$$0.091$$	$$0.114$$
$$N=5000,~p=100$$	covariance (glmnet)	$$0.026$$	$$0.028$$	$$0.028$$	$$0.031$$	$$0.051$$	$$0.072$$
	naive (glmnet)	$$0.026$$	$$0.028$$	$$0.028$$	$$0.031$$	$$0.053$$	$$0.072$$
	SGD	$$0.020$$	$$0.020$$	$$0.021$$	$$0.022$$	$$0.146$$	$$0.206$$
	implicit SGD	$$0.020$$	$$0.020$$	$$0.021$$	$$0.022$$	$$0.036$$	$$0.046$$
$$N=100,~p=1000$$	covariance (glmnet)	$$0.063$$	$$0.060$$	$$0.061$$	$$0.057$$	$$0.066$$	$$0.071$$
	naive (glmnet)	$$0.063$$	$$0.060$$	$$0.061$$	$$0.057$$	$$0.066$$	$$0.071$$
	SGD	$$0.071$$	$$0.072$$	$$0.072$$	$$0.073$$	$$0.074$$	$$0.074$$
	implicit SGD	$$0.071$$	$$0.071$$	$$0.071$$	$$0.072$$	$$0.073$$	$$0.073$$
$$N=100,~p=5000$$	covariance (glmnet)	$$0.030$$	$$0.031$$	$$0.031$$	$$0.033$$	$$0.031$$	$$0.033$$
	naive (glmnet)	$$0.030$$	$$0.031$$	$$0.031$$	$$0.033$$	$$0.031$$	$$0.033$$
	SGD	$$0.033$$	$$0.032$$	$$0.033$$	$$0.033$$	$$0.033$$	$$0.033$$
	implicit SGD	$$0.033$$	$$0.032$$	$$0.033$$	$$0.033$$	$$0.033$$	$$0.033$$
$$N=100,~p=20000$$	covariance (glmnet)	$$0.016$$	$$0.015$$	$$0.015$$	$$0.015$$	$$0.016$$	$$0.016$$
	naive (glmnet)	$$0.016$$	$$0.015$$	$$0.015$$	$$0.015$$	$$0.016$$	$$0.016$$
	SGD	$$0.016$$	$$0.016$$	$$0.016$$	$$0.016$$	$$0.016$$	$$0.016$$
	implicit SGD	$$0.016$$	$$0.016$$	$$0.016$$	$$0.016$$	$$0.016$$	$$0.016$$
$$N=100,~p=50000$$	covariance (glmnet)	$$0.010$$	$$0.010$$	$$0.010$$	$$0.010$$	$$0.010$$	$$0.010$$
	naive (glmnet)	$$0.010$$	$$0.010$$	$$0.010$$	$$0.010$$	$$0.010$$	$$0.010$$
	SGD	$$0.010$$	$$0.010$$	$$0.010$$	$$0.010$$	$$0.010$$	$$0.010$$
	implicit SGD	$$0.010$$	$$0.010$$	$$0.010$$	$$0.010$$	$$0.010$$	$$0.010$$

Stochastic gradient descent is obviously faster for large $p$ . In particular, it achieves much faster performance after the $p\geq 1000$ cases, and with competitive mean squared error if $N$ or $P$ is large ( $N$ or $p\geq 1000$ ). Empirically speaking, it’s quite safe to conclude that stochastic gradient descent is the better scalable algorithm. Moreover, SGD’s error is independent of the correlation, so while elastic net is faster for small $N$ or $p$ , we still recommend using SGD in the case of highly correlated variables and small $N$ or $p$ . The better estimate provided by SGD may be worth more than the slightly slower runtime.

We can also run the same simulation varying $N$ and $p$ with averaged SGD (ASGD), which takes roughly the same time as standard SGD but with an improved estimate. The results would essentially be the same, only that the cutoff for a better estimate for ASGD over elastic net would be at a smaller $N$ than standard SGD’s.

The takeaway message? SGD is awesome.

[1] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.

[2] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B, 58:267–288, 1996.

[3] H. Zou H, T. Hastie. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society B, 67(2):301–320, 2005.