There were a number of exciting things happening at ICML this past week, which took place in Lille, France.

Deep learning remains the primary interest among a lot of research and excitement at ICML, where questions related to them would percolate even to the Bayesian nonparametrics and approximate inference sessions. It looks like a lot of the community has been paying more attention to introducing uncertainty in neural networks. (Deep) generative models are starting to get more headway now that approximate Bayesian inference algorithms—variational inference especially—are more tractable. Buzzwords now concentrate on variational autoencoders, probabilistic backpropagation, and deep latent variable models.

On strictly the probabilistic side, there continues to be more work on increasing computational gains with subsampling, distributed implementations, and sparse GPs. There’s been a lot of interesting work on trying to merge various approximate inference algorithms in order to obtain a more unifying framework.

ICML had this running theme on generalizability since Leon Bottou’s keynote talk, discussing the limitations of machine learning and the general inability for current algorithms to easily infer from small data sets as humans do. Transfer learning, zero-shot learning, and comments on approaches from cognitive science received more exposure.

There was another theme on context and learning actual concepts: why should a picture of a car on a road have a higher probability of being classified as a car than a picture of a car in a swimming pool? It seems no matter how powerful our computer vision algorithms get, they still do not *grok* what a car is. The statistical answer is that it’s nigh impossible to learn a true “test” distribution that is not the same distribution as that generated for the training data; cars in swimming pools are simply something our algorithms haven’t seen much of, and we should be able to somehow weight the learning more on the tail of the distribution.

## Favorite papers

- Variational Inference with Normalizing Flows: Danilo Rezende and Shakir Mohamed demonstrate how to increase model complexity in the variational approximation. Their approach is to use sequences of nonlinear transformations on a simple distribution, and the number of sequences—analogous to the number of “layers” in deep learning—characterizes the tradeoff of accuracy and scalability.
- Gradient-based Hyperparameter Optimization through Reversible Learning: Dougal Maclaurin, David Duvenaud, and Ryan Adams show how to backpropagate gradients for optimizing hyperparameters. It essentially reduces to performing automatic differentiation well, and the experiments they try this on are really cool, e.g., optimizing the learning rate schedule per layer of a NN, optimizing training data(!), and optimizing the initialization of SGD. How this will apply to Bayesian optimization for instance remains to be seen.
- The Fundamental Incompatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsampling: Michael Betancourt explains that doing scalable HMC by subsampling data is biased, and thus your MCMC algorithm is no longer a principled procedure. He derives the biases analytically and explains the effect this should have on the practical usage of subsampling in MCMC.
- Markov Chain Monte Carlo and Variational Inference: Bridging the Gap: Tim Salimans, Durk Kingma, and Max Welling demonstrate how to perform MCMC within variational inference in order to also get better approximations. It’s an interesting combination of two inference algorithms based on the same tradeoff of accuracy and scalability, and it’s still an open question how much one can try to quantify this and thus design (or use) the appropriate algorithm for one’s setting.
- A trust-region method for stochastic variational inference with applications to streaming data: Lucas Theis and Matt Hoffman show how to do variational inference more robustly, which has seen very little focus despite the crucial flaws of mean-field approximations based on their local optima and the lack of stability in general using the original stochastic gradient descent updates. Proximal stochastic gradient methods are what the optimization community has been focusing on because of this, and it’s nice to see such knowledge also apply to variational inference.
- Markov mixed membership models: Aonan Zhang and John Paisely extend LDA so that the topics form a graph! They use Markov transitions to model the latent graph structure. It’s a simple idea that is certainly worth pursuing if we ever aim to do topic modelling on a large-scale setting, in which there is so much more latent topic structure to uncover. Bayesian nonparametric graphs are the future but we haven’t seen much application of it quite yet.