The elastic net [3] provides a regularized objective function that meets a compromise between the two extremes of Lasso [2] and ridge regression. It takes into account both the Bayesian properties of ridge regression and also the need for sparse parameters via the 1\ell_1 penalty from Lasso.

Let {(xi,yi): i=1,,N}\{(x_i,y_i):~i=1,\ldots,N\} be our observed data set where the responses yiRy_i\in\mathbb{R} and the predictors xiRpx_i\in\mathbb{R}^p. Simply put, elastic net regularization is defined by solving

min(β0,β)Rp+1Rλ(β0,β)=min(β0,β)Rp+1[12Ni=1N(yiβ0xiTβ)2+λPα(β)],\displaystyle {\min_{(\beta_0,\beta)\in\mathbb{R}^{p+1}} R_\lambda(\beta_0, \beta) = \min_{(\beta_0,\beta)\in\mathbb{R}^{p+1}} \left[ \frac{1}{2N} \sum_{i=1}^N (y_i - \beta_0 - x_i^T\beta)^2 + \lambda P_\alpha(\beta) \right],}

where the penalty function is defined as

Pα(β)=(1α)12β22+αβ1.\displaystyle {P_\alpha(\beta) = (1-\alpha)\frac{1}{2}\|\beta\|_{\ell_2}^2 + \alpha\|\beta\|_{\ell_1}.}

This was applied to generalized linear models (GLMs) by Tibshirani, and also encoded as an R package for optimized calculations [1]. In particular, one can take advantage of the many sparse entries underlying the computation, as the dimension of the corresponding matrices written in sparse column format is effectively smaller; this leads to reduced runtimes of O(pm)O(pm) as the inner product operations sum only over the nonzero terms (mm being the number of nonzero terms, pp the dimension of the predictors).

In one example, Friedman et. al [1] effectively show that this method to compute elastic nets is more efficient and more accurate than LARS, another regression algorithm meant for high-dimensional problems. Statisticians tend to keep to their own methods for computation, so let’s try something different; let’s see what happens when under the same example, we run standard stochastic gradient descent (SGD) and its implicit variant as a comparison.

We simulate data according to the example offered in the paper by Friedman et. al [1]: generate NN Gaussian observations with pp predictors, where each predictor pair (Xj, Xj)(X_j,~X_{j'}) has the same population correlation ρ\rho. Range ρ\rho from 00 to 0.950.95, and generate outcomes

Y=j=1pXjβj+kZ,\displaystyle {Y = \sum_{j=1}^p X_j\beta_j + k\cdot Z,}

where βj=(1)jexp(2(j1)/20)\beta_j=(-1)^j\exp(-2(j-1)/20), ZN(0,1)Z\sim N(0,1), and kk is chosen so that the signal-to-noise ratio is 3.03.0.

We use the R package glmnet in order to emulate the times displayed on Table 1 of the paper, with both the “naive” and “covariance” variants of the algorithm. As for SGD and implicit SGD, let b2=ρ/(1ρ)b^2=\rho/(1-\rho), so then the covariance matrix can be rewritten as E[xxT]=b2U+I.\mathbb{E}[xx^T] = b^2U+I. Then for computing standard SGD, we shall use the learning rate

αn=ααγ0+n,\displaystyle {\alpha_n = \frac{\alpha}{\frac{\alpha}{\gamma_0} +n},}

where γ0=1/trace(A)=1/((1+b2)p)\gamma_0=1/\mathrm{trace}(A) = 1/((1+b^2)p) and α\alpha is the minimal eigenvalue λ0=1\lambda_0=1. (The latter is true because all eigenvalues are 1 excluding one which is λ=1+pb2\lambda=1+pb^2. One can prove this easily by (b2U+I)x=λx(b^2U+I)x=\lambda x iff (1λ)x=b2S1(1-\lambda)x = -b^2 S\cdot 1, where S=xiS=\sum x_i.) As for implicit SGD, we’ll just use αn=α/n\alpha_n=\alpha/n.

The following is run on a 2.6 Ghz i5 core processor.

Runtime (s)

dimension space method $$\rho=0$$ $$\rho=0.10$$ $$\rho=0.20$$ $$\rho=0.50$$ $$\rho=0.90$$ $$\rho=0.95$$
$$N=1000,~p=100$$ covariance (glmnet) $$0.011$$ $$0.010$$ $$0.010$$ $$0.011$$ $$0.015$$ $$0.017$$
naive (glmnet) $$0.032$$ $$0.033$$ $$0.037$$ $$0.048$$ $$0.134$$ $$0.177$$
SGD $$0.020$$ $$0.021$$ $$0.021$$ $$0.023$$ $$0.019$$ $$0.020$$
implicit SGD $$0.020$$ $$0.020$$ $$0.019$$ $$0.019$$ $$0.019$$ $$0.018$$
$$N=5000,~p=100$$ covariance (glmnet) $$0.035$$ $$0.033$$ $$0.034$$ $$0.037$$ $$0.039$$ $$0.040$$
naive (glmnet) $$0.119$$ $$0.121$$ $$0.139$$ $$0.179$$ $$0.490$$ $$0.834$$
SGD $$0.118$$ $$0.124$$ $$0.113$$ $$0.112$$ $$0.113$$ $$0.112$$
implicit SGD $$0.105$$ $$0.104$$ $$0.105$$ $$0.104$$ $$0.105$$ $$0.104$$
$$N=100,~p=1000$$ covariance (glmnet) $$0.041$$ $$0.041$$ $$0.045$$ $$0.061$$ $$0.098$$ -
naive (glmnet) $$0.023$$ $$0.020$$ $$0.025$$ $$0.032$$ $$0.046$$ -
SGD $$0.007$$ $$0.007$$ $$0.006$$ $$0.008$$ $$0.006$$ -
implicit SGD $$0.007$$ $$0.007$$ $$0.006$$ $$0.006$$ $$0.006$$ -
$$N=100,~p=5000$$ covariance (glmnet) $$0.204$$ $$0.238$$ $$0.243$$ $$0.314$$ $$0.517$$ $$0.615$$
naive (glmnet) $$0.075$$ $$0.093$$ $$0.071$$ $$0.082$$ $$0.098$$ $$0.145$$
SGD $$0.031$$ $$0.031$$ $$0.032$$ $$0.031$$ $$0.029$$ $$0.033$$
implicit SGD $$0.029$$ $$0.028$$ $$0.029$$ $$0.028$$ $$0.028$$ $$0.030$$
$$N=100,~p=20000$$ covariance (glmnet) $$0.923$$ $$0.906$$ $$0.955$$ $$1.313$$ $$2.082$$ $$2.927$$
naive (glmnet) $$0.381$$ $$0.377$$ $$0.376$$ $$0.390$$ $$0.405$$ $$0.736$$
SGD $$0.152$$ $$0.147$$ $$0.148$$ $$0.156$$ $$0.147$$ $$0.153$$
implicit SGD $$0.140$$ $$0.170$$ $$0.146$$ $$0.143$$ $$0.137$$ $$0.145$$
$$N=100,~p=50000$$ covariance (glmnet) $$2.280$$ $$2.321$$ $$2.322$$ $$2.693$$ $$4.901$$ $$6.176$$
naive (glmnet) $$0.745$$ $$0.770$$ $$0.753$$ $$0.775$$ $$0.996$$ $$1.735$$
SGD $$0.421$$ $$0.392$$ $$0.379$$ $$0.383$$ $$0.390$$ $$0.393$$
implicit SGD $$0.363$$ $$0.370$$ $$0.366$$ $$0.374$$ $$0.368$$ $$0.370$$

Mean Squared Error

dimension space method $$\rho=0$$ $$\rho=0.10$$ $$\rho=0.20$$ $$\rho=0.50$$ $$\rho=0.90$$ $$\rho=0.95$$
$$N=1000,~p=100$$ covariance (glmnet) $$0.045$$ $$0.045$$ $$0.043$$ $$0.048$$ $$0.054$$ $$0.075$$
naive (glmnet) $$0.045$$ $$0.045$$ $$0.043$$ $$0.048$$ $$0.054$$ $$0.075$$
SGD $$0.054$$ $$0.057$$ $$0.059$$ $$0.062$$ $$0.209$$ $$0.228$$
implicit SGD $$0.054$$ $$0.057$$ $$0.058$$ $$0.057$$ $$0.091$$ $$0.114$$
$$N=5000,~p=100$$ covariance (glmnet) $$0.026$$ $$0.028$$ $$0.028$$ $$0.031$$ $$0.051$$ $$0.072$$
naive (glmnet) $$0.026$$ $$0.028$$ $$0.028$$ $$0.031$$ $$0.053$$ $$0.072$$
SGD $$0.020$$ $$0.020$$ $$0.021$$ $$0.022$$ $$0.146$$ $$0.206$$
implicit SGD $$0.020$$ $$0.020$$ $$0.021$$ $$0.022$$ $$0.036$$ $$0.046$$
$$N=100,~p=1000$$ covariance (glmnet) $$0.063$$ $$0.060$$ $$0.061$$ $$0.057$$ $$0.066$$ $$0.071$$
naive (glmnet) $$0.063$$ $$0.060$$ $$0.061$$ $$0.057$$ $$0.066$$ $$0.071$$
SGD $$0.071$$ $$0.072$$ $$0.072$$ $$0.073$$ $$0.074$$ $$0.074$$
implicit SGD $$0.071$$ $$0.071$$ $$0.071$$ $$0.072$$ $$0.073$$ $$0.073$$
$$N=100,~p=5000$$ covariance (glmnet) $$0.030$$ $$0.031$$ $$0.031$$ $$0.033$$ $$0.031$$ $$0.033$$
naive (glmnet) $$0.030$$ $$0.031$$ $$0.031$$ $$0.033$$ $$0.031$$ $$0.033$$
SGD $$0.033$$ $$0.032$$ $$0.033$$ $$0.033$$ $$0.033$$ $$0.033$$
implicit SGD $$0.033$$ $$0.032$$ $$0.033$$ $$0.033$$ $$0.033$$ $$0.033$$
$$N=100,~p=20000$$ covariance (glmnet) $$0.016$$ $$0.015$$ $$0.015$$ $$0.015$$ $$0.016$$ $$0.016$$
naive (glmnet) $$0.016$$ $$0.015$$ $$0.015$$ $$0.015$$ $$0.016$$ $$0.016$$
SGD $$0.016$$ $$0.016$$ $$0.016$$ $$0.016$$ $$0.016$$ $$0.016$$
implicit SGD $$0.016$$ $$0.016$$ $$0.016$$ $$0.016$$ $$0.016$$ $$0.016$$
$$N=100,~p=50000$$ covariance (glmnet) $$0.010$$ $$0.010$$ $$0.010$$ $$0.010$$ $$0.010$$ $$0.010$$
naive (glmnet) $$0.010$$ $$0.010$$ $$0.010$$ $$0.010$$ $$0.010$$ $$0.010$$
SGD $$0.010$$ $$0.010$$ $$0.010$$ $$0.010$$ $$0.010$$ $$0.010$$
implicit SGD $$0.010$$ $$0.010$$ $$0.010$$ $$0.010$$ $$0.010$$ $$0.010$$


Stochastic gradient descent is obviously faster for large pp. In particular, it achieves much faster performance after the p1000p\geq 1000 cases, and with competitive mean squared error if NN or PP is large (NN or p1000p\geq 1000). Empirically speaking, it’s quite safe to conclude that stochastic gradient descent is the better scalable algorithm. Moreover, SGD’s error is independent of the correlation, so while elastic net is faster for small NN or pp, we still recommend using SGD in the case of highly correlated variables and small NN or pp. The better estimate provided by SGD may be worth more than the slightly slower runtime.

We can also run the same simulation varying NN and pp with averaged SGD (ASGD), which takes roughly the same time as standard SGD but with an improved estimate. The results would essentially be the same, only that the cutoff for a better estimate for ASGD over elastic net would be at a smaller NN than standard SGD’s.

The takeaway message? SGD is awesome.

[1] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.

[2] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B, 58:267–288, 1996.

[3] H. Zou H, T. Hastie. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society B, 67(2):301–320, 2005.