On Model Mismatch and Bayesian Analysis

One aspect I always enjoy about machine learning is that questions often go back to the basics. The field essentially goes into an existential crisis every dozen years—rethinking our tools and asking foundational questions such as “why neural networks” or “why generative models”.¹

This was a theme in my conversations during NIPS 2016 last week, where a frequent topic was on the advantages of a Bayesian perspective to machine learning. Not surprisingly, this appeared as a big discussion point during the panel at the Bayesian deep learning workshop, where many panelists were conciliatory to the use of non-Bayesian approaches. (Granted, much of it was Neil trolling them to admit when non-Bayesian approaches worked better in practice.)

One argument against Bayesian analysis went as follows:

While Bayesian inference can capture uncertainty about parameters, it relies on the model being correctly specified. However, in practice, all models are wrong. And in fact, this model mismatch can be often be large enough that we should be more concerned with calibrating our inferences to correct for the mismatch than to produce uncertainty estimates from incorrect assumptions.

A related complaint was on the separation of model and inference, a philosophical point commonly associated with Bayesians:

While in principle it is nice that we can build models separate from our choice of inference, we often need to combine the two in practice. (The whole naming behind the popular model-inference classes of “variational auto-encoders” (Kingma & Welling, 2014) and “generative adversarial networks” (Goodfellow et al., 2014) are one example.) That is, we often choose our model based on what we know enables fast inferences, or we select hyperparameters in our model from data. This goes against the Bayesian paradigm.

First, I’d like to say immediately that I think interpreting Bayesian analysis as a two-step procedure of setting up a probability model, then performing posterior inference is outdated. Certainly this was the prevailing perspective back in the 80s’ and 90s’ when Markov chain Monte Carlo was first popularized, and when statisticians started to take Bayesian analysis more seriously (Robert & Casella, 2011).

Quoting Gelman & Shalizi (2012) who summarize this perspective, “The expression $p(\theta\mid y)$ says it all, and the central goal of Bayesian inference is computing the posterior probabilities of hypotheses. Anything not contained in the posterior distribution $p(\theta\mid y)$ is simply irrelevant, and it would be irrational (or incoherent) to attempt falsification, unless that somehow shows up in the posterior.”

Like many statisticians before me (e.g., Box (1980), Good (1983), Rubin (1984), Jaynes (2003)), I believe this perspective is wrong. Bayesian analysis is no different in its testing and falsification of models than any other inferential paradigm (Fisher (1925), Neyman & Pearson (1933)).

An important third step to all empirical analyses is model criticism (Box (1980), O’Hagan (2001) ), also known as model validation, or model checking and diagnostics (Rubin (1984), Meng (1994), Gelman, Meng, & Stern (1996)). In criticizing our models after inference, we can either justify use of the model or find directions in which we can revise the model. By revising the model, we go back to the modeling step, thus forming a loop, called Box’s loop (Box (1976), Blei (2014), Gelman et al. (2013)). ²

From my perspective, this solves the perceived problem of conflating model and inference, whether it be to address model mismatch or to build the model from previous inferences or data. That is, while posterior inference is simply a mechanical step of calculating a conditional distribution, the step of model criticism is about the relevance of the model to future data—to put it in statistical terms, the relevance of the model with respect to a population distribution (Wasserman, 2006). As with data, the model is just a source of information, and posterior inference simply aggregates these two sources of information. Thus it makes sense that as we better understand properties of the data, we can revise our information to better formulate a model of it (Tukey, 1977).

This might sound like an awkward way to shoehorn Bayesian analysis to mimick frequentist properties, or no different from combining model and inference from the get-go. However, this loop is fundamental because it still emphasizes the importance of separating the two. We can continue to form hypothetico-deductive analyses—namely, a falsificationist view of the world where components of model, inference, and criticism interact—while still incorporating posterior probabilities.

For more details, I highly recommend Gelman & Shalizi (2012) and of course the classic, Rubin (1984).

¹ I take an optimistic viewpoint to the trend of cycling among tools for machine learning. The trend is based on what works best empirically, and I think that’s important.

² As a plug, I should also mention that this is what Edward is all about.

References

Blei, D. M. (2014). Build, compute, critique, repeat: Data analysis with latent variable models. Annual Review of Statistics and Its Application.
Box, G. E. P. (1980). Sampling and Bayes’ inference in scientific modelling and robustness. Journal of the Royal Statistical Society. Series A. General, 143(4), 383–430.
Box, G. E. P. (1976). Science and statistics. Journal of the American Statistical Association, 71(356), 791–799.
Fisher, R. A. (1925). Statistical Methods for Research Workers. Genesis Publishing Pvt Ltd.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis (Third). CRC Press, Boca Raton, FL.
Gelman, A., Meng, X.-L., & Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica.
Gelman, A., & Shalizi, C. R. (2012). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66(1), 8–38.
Good, I. J. (1983). Good thinking: The foundations of probability and its applications. U of Minnesota Press.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative Adversarial Nets. In Neural Information Processing Systems.
Jaynes, E. T. (2003). Probability theory: The logic of science. Washington University St. Louis, MO.
Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. In International Conference on Learning Representations.
Meng, X.-L. (1994). Posterior predictive p-values. The Annals of Statistics.
Neyman, J., & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society A Mathematical, Physical and Engineering Sciences, 231, 289–337.
O’Hagan, A. (2001). HSSS model criticism. University of Sheffield, Department of Probability and Statistics.
Robert, C., & Casella, G. (2011). A short history of Markov Chain Monte Carlo: subjective recollections from incomplete data. Statistical Science.
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4), 1151–1172.
Tukey, J. W. (1977). Exploratory data analysis.
Wasserman, L. (2006). Frequentist Bayes is objective (comment on articles by Berger and by Goldstein). Bayesian Analysis, 1(3), 451–456.