Update (09/01/17): The post is written to be somewhat silly and numbers are not meant to be accurate. For example, there is a simplifying assumption that training time scales linearly with the # of bits to encode the output; and 5000 is chosen arbitrarily given only that the output’s range has 65K*3 dimensions and each takes one of 256 integers.
Discriminative models can take weeks to train. It was only until a breakthrough two months ago by Facebook (Goyal et al., 2017) that we could successfully train a neural net exceeding human accuracy (ResNet-50) on ImageNet in one hour. And this was with 256 GPUs and a monstrous batch size of 8192.
Contrast this with generative models. We’ve made progress in stability and sample diversity with generative adversarial networks, where, say, Wasserstein GANs with gradient penalty (Gulrajani, Ahmed, Arjovsky, Dumoulin, & Courville, 2017) and Cramer GANs (Bellemare et al., 2017) can get good results for generating LSUN bedrooms. But in communication with Ishaan Gulrajani, this took 3 days to train with 4 GPUs and 900,000 total iterations; moreover, LSUN has a resolution of 64x64 and is significantly less diverse than the 256x256 sized ImageNet. Let’s also not kid ourselves that we perfected density estimation to learn the true distribution of LSUN bedrooms yet.
Generative models for text are no different. The best results so far for the 1 billion language modeling benchmark are an LSTM with 151 million parameters (excluding embedding and softmax layers) which took 3 weeks to train with 32 GPUs (Jozefowicz, Vinyals, Schuster, Shazeer, & Wu, 2016) and a mixture of experts LSTM with 4.3 billion parameters (Shazeer et al., 2017).
This begs the question: how much compute should we expect in order to learn a generative model?
Suppose we restrict ourselves to 256x256 ImageNet as a proxy for natural images. A simple property in information theory says that the the entropy of the conditional is upper bounded by at most bits for classes. Comparing this to the entropy of the unconditional , whose number of bits is a function of pixels each of which take 3 values from , then a very modest guess would be that has 5000 times more bits. We also need to account for the difference in training methods. Let’s say that the method for generative models is only 6x slower than that of discriminative models (5 discriminative updates per generator update; we’ll forget the fact that GAN and MMD objectives are actually more expensive than maximum likelihood due to multiple forward and backward passes).
Finally, let’s take Facebook’s result as a baseline for learning in 1 hour with 256 GPUs and a batch size of 8192. Then the distribution would require 1 hour 5000 6 30,000 hours 3.4 years to train. And this is assuming we have the right objective, architecture, and hyperparameters to set it and forget it: until then, let’s hope for better hardware.
This short post is extracted from a fun conversation with Alec Radford today.
References
- Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., & Munos, R. (2017). The Cramer Distance as a Solution to Biased Wasserstein Gradients. ArXiv Preprint ArXiv:1705.10743.
- Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., … He, K. (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. ArXiv Preprint ArXiv:1706.02677.
- Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved Training of Wasserstein GANs. ArXiv Preprint ArXiv:1704.00028.
- Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the Limits of Language Modeling. ArXiv Preprint ArXiv:1602.02410.
- Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations.