The elastic net [3] provides a regularized objective function that meets a compromise between the two extremes of Lasso [2] and ridge regression. It takes into account both the Bayesian properties of ridge regression and also the need for sparse parameters via the penalty from Lasso.

Let be our observed data set where the responses and the predictors . Simply put, elastic net regularization is defined by solving

where the penalty function is defined as

This was applied to generalized linear models (GLMs) by Tibshirani, and also encoded as an R package for optimized calculations [1]. In particular, one can take advantage of the many sparse entries underlying the computation, as the dimension of the corresponding matrices written in sparse column format is effectively smaller; this leads to reduced runtimes of as the inner product operations sum only over the nonzero terms ( being the number of nonzero terms, the dimension of the predictors).

In one example, Friedman et. al [1] effectively show that this method to compute elastic nets is more efficient and more accurate than LARS, another regression algorithm meant for high-dimensional problems. Statisticians tend to keep to their own methods for computation, so let’s try something different; let’s see what happens when under the same example, we run standard stochastic gradient descent (SGD) and its implicit variant as a comparison.

We simulate data according to the example offered in the paper by Friedman et. al [1]: generate Gaussian observations with predictors, where each predictor pair has the same population correlation . Range from to , and generate outcomes

where , , and is chosen so that the signal-to-noise ratio is .

We use the R package `glmnet`

in order to emulate the times displayed on Table 1 of the paper, with both the “naive” and “covariance” variants of the algorithm. As for SGD and implicit SGD, let , so then the covariance matrix can be rewritten as
Then for computing standard SGD, we shall use the learning rate

where and is the minimal eigenvalue . (The latter is true because all eigenvalues are 1 excluding one which is . One can prove this easily by iff , where .) As for implicit SGD, we’ll just use .

The following is run on a 2.6 Ghz i5 core processor.

## Runtime (s)

dimension space | method | $$\rho=0$$ | $$\rho=0.10$$ | $$\rho=0.20$$ | $$\rho=0.50$$ | $$\rho=0.90$$ | $$\rho=0.95$$ |
---|---|---|---|---|---|---|---|

$$N=1000,~p=100$$ | covariance (glmnet) | $$0.011$$ | $$0.010$$ | $$0.010$$ | $$0.011$$ | $$0.015$$ | $$0.017$$ |

naive (glmnet) | $$0.032$$ | $$0.033$$ | $$0.037$$ | $$0.048$$ | $$0.134$$ | $$0.177$$ | |

SGD | $$0.020$$ | $$0.021$$ | $$0.021$$ | $$0.023$$ | $$0.019$$ | $$0.020$$ | |

implicit SGD | $$0.020$$ | $$0.020$$ | $$0.019$$ | $$0.019$$ | $$0.019$$ | $$0.018$$ | |

$$N=5000,~p=100$$ | covariance (glmnet) | $$0.035$$ | $$0.033$$ | $$0.034$$ | $$0.037$$ | $$0.039$$ | $$0.040$$ |

naive (glmnet) | $$0.119$$ | $$0.121$$ | $$0.139$$ | $$0.179$$ | $$0.490$$ | $$0.834$$ | |

SGD | $$0.118$$ | $$0.124$$ | $$0.113$$ | $$0.112$$ | $$0.113$$ | $$0.112$$ | |

implicit SGD | $$0.105$$ | $$0.104$$ | $$0.105$$ | $$0.104$$ | $$0.105$$ | $$0.104$$ | |

$$N=100,~p=1000$$ | covariance (glmnet) | $$0.041$$ | $$0.041$$ | $$0.045$$ | $$0.061$$ | $$0.098$$ | - |

naive (glmnet) | $$0.023$$ | $$0.020$$ | $$0.025$$ | $$0.032$$ | $$0.046$$ | - | |

SGD | $$0.007$$ | $$0.007$$ | $$0.006$$ | $$0.008$$ | $$0.006$$ | - | |

implicit SGD | $$0.007$$ | $$0.007$$ | $$0.006$$ | $$0.006$$ | $$0.006$$ | - | |

$$N=100,~p=5000$$ | covariance (glmnet) | $$0.204$$ | $$0.238$$ | $$0.243$$ | $$0.314$$ | $$0.517$$ | $$0.615$$ |

naive (glmnet) | $$0.075$$ | $$0.093$$ | $$0.071$$ | $$0.082$$ | $$0.098$$ | $$0.145$$ | |

SGD | $$0.031$$ | $$0.031$$ | $$0.032$$ | $$0.031$$ | $$0.029$$ | $$0.033$$ | |

implicit SGD | $$0.029$$ | $$0.028$$ | $$0.029$$ | $$0.028$$ | $$0.028$$ | $$0.030$$ | |

$$N=100,~p=20000$$ | covariance (glmnet) | $$0.923$$ | $$0.906$$ | $$0.955$$ | $$1.313$$ | $$2.082$$ | $$2.927$$ |

naive (glmnet) | $$0.381$$ | $$0.377$$ | $$0.376$$ | $$0.390$$ | $$0.405$$ | $$0.736$$ | |

SGD | $$0.152$$ | $$0.147$$ | $$0.148$$ | $$0.156$$ | $$0.147$$ | $$0.153$$ | |

implicit SGD | $$0.140$$ | $$0.170$$ | $$0.146$$ | $$0.143$$ | $$0.137$$ | $$0.145$$ | |

$$N=100,~p=50000$$ | covariance (glmnet) | $$2.280$$ | $$2.321$$ | $$2.322$$ | $$2.693$$ | $$4.901$$ | $$6.176$$ |

naive (glmnet) | $$0.745$$ | $$0.770$$ | $$0.753$$ | $$0.775$$ | $$0.996$$ | $$1.735$$ | |

SGD | $$0.421$$ | $$0.392$$ | $$0.379$$ | $$0.383$$ | $$0.390$$ | $$0.393$$ | |

implicit SGD | $$0.363$$ | $$0.370$$ | $$0.366$$ | $$0.374$$ | $$0.368$$ | $$0.370$$ |

## Mean Squared Error

dimension space | method | $$\rho=0$$ | $$\rho=0.10$$ | $$\rho=0.20$$ | $$\rho=0.50$$ | $$\rho=0.90$$ | $$\rho=0.95$$ |
---|---|---|---|---|---|---|---|

$$N=1000,~p=100$$ | covariance (glmnet) | $$0.045$$ | $$0.045$$ | $$0.043$$ | $$0.048$$ | $$0.054$$ | $$0.075$$ |

naive (glmnet) | $$0.045$$ | $$0.045$$ | $$0.043$$ | $$0.048$$ | $$0.054$$ | $$0.075$$ | |

SGD | $$0.054$$ | $$0.057$$ | $$0.059$$ | $$0.062$$ | $$0.209$$ | $$0.228$$ | |

implicit SGD | $$0.054$$ | $$0.057$$ | $$0.058$$ | $$0.057$$ | $$0.091$$ | $$0.114$$ | |

$$N=5000,~p=100$$ | covariance (glmnet) | $$0.026$$ | $$0.028$$ | $$0.028$$ | $$0.031$$ | $$0.051$$ | $$0.072$$ |

naive (glmnet) | $$0.026$$ | $$0.028$$ | $$0.028$$ | $$0.031$$ | $$0.053$$ | $$0.072$$ | |

SGD | $$0.020$$ | $$0.020$$ | $$0.021$$ | $$0.022$$ | $$0.146$$ | $$0.206$$ | |

implicit SGD | $$0.020$$ | $$0.020$$ | $$0.021$$ | $$0.022$$ | $$0.036$$ | $$0.046$$ | |

$$N=100,~p=1000$$ | covariance (glmnet) | $$0.063$$ | $$0.060$$ | $$0.061$$ | $$0.057$$ | $$0.066$$ | $$0.071$$ |

naive (glmnet) | $$0.063$$ | $$0.060$$ | $$0.061$$ | $$0.057$$ | $$0.066$$ | $$0.071$$ | |

SGD | $$0.071$$ | $$0.072$$ | $$0.072$$ | $$0.073$$ | $$0.074$$ | $$0.074$$ | |

implicit SGD | $$0.071$$ | $$0.071$$ | $$0.071$$ | $$0.072$$ | $$0.073$$ | $$0.073$$ | |

$$N=100,~p=5000$$ | covariance (glmnet) | $$0.030$$ | $$0.031$$ | $$0.031$$ | $$0.033$$ | $$0.031$$ | $$0.033$$ |

naive (glmnet) | $$0.030$$ | $$0.031$$ | $$0.031$$ | $$0.033$$ | $$0.031$$ | $$0.033$$ | |

SGD | $$0.033$$ | $$0.032$$ | $$0.033$$ | $$0.033$$ | $$0.033$$ | $$0.033$$ | |

implicit SGD | $$0.033$$ | $$0.032$$ | $$0.033$$ | $$0.033$$ | $$0.033$$ | $$0.033$$ | |

$$N=100,~p=20000$$ | covariance (glmnet) | $$0.016$$ | $$0.015$$ | $$0.015$$ | $$0.015$$ | $$0.016$$ | $$0.016$$ |

naive (glmnet) | $$0.016$$ | $$0.015$$ | $$0.015$$ | $$0.015$$ | $$0.016$$ | $$0.016$$ | |

SGD | $$0.016$$ | $$0.016$$ | $$0.016$$ | $$0.016$$ | $$0.016$$ | $$0.016$$ | |

implicit SGD | $$0.016$$ | $$0.016$$ | $$0.016$$ | $$0.016$$ | $$0.016$$ | $$0.016$$ | |

$$N=100,~p=50000$$ | covariance (glmnet) | $$0.010$$ | $$0.010$$ | $$0.010$$ | $$0.010$$ | $$0.010$$ | $$0.010$$ |

naive (glmnet) | $$0.010$$ | $$0.010$$ | $$0.010$$ | $$0.010$$ | $$0.010$$ | $$0.010$$ | |

SGD | $$0.010$$ | $$0.010$$ | $$0.010$$ | $$0.010$$ | $$0.010$$ | $$0.010$$ | |

implicit SGD | $$0.010$$ | $$0.010$$ | $$0.010$$ | $$0.010$$ | $$0.010$$ | $$0.010$$ |

Stochastic gradient descent is obviously faster for large . In particular, it achieves much faster performance after the cases, and with competitive mean squared error if or is large ( or ). Empirically speaking, it’s quite safe to conclude that stochastic gradient descent is the better scalable algorithm. Moreover, SGD’s error is independent of the correlation, so while elastic net is faster for small or , we still recommend using SGD in the case of highly correlated variables and small or . The better estimate provided by SGD may be worth more than the slightly slower runtime.

We can also run the same simulation varying and with averaged SGD (ASGD), which takes roughly the same time as standard SGD but with an improved estimate. The results would essentially be the same, only that the cutoff for a better estimate for ASGD over elastic net would be at a smaller than standard SGD’s.

The takeaway message? SGD is awesome.

[1] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear
models via coordinate descent. *Journal of statistical software*, 33(1):1, 2010.

[2] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. *Journal of the Royal
Statistical Society B*, 58:267–288, 1996.

[3] H. Zou H, T. Hastie. Regularization and Variable Selection via the Elastic Net. *Journal
of the Royal Statistical Society B*, 67(2):301–320, 2005.