I’ve been helping write tutorials that teach concepts such as black box variational inference in Edward. And as I’ve been editing, I’ve noticed the majority of my suggestions are about the writing rather than the code. Communication is important—arguably more important in papers than the idea itself. Communication evokes different lines of thinking about how to approach problems, and it always has subtleties regarding the different culture and set of problems one finds important. Here are a few subtleties that have come up recently and my stock answers to them.
-
“Bayesian models” framed as placing priors on parameters of a likelihood. This is a very frequentist -> Bayesian line of thinking. I strongly believe models should simply be framed as a joint distribution for data and latent variables .1 The other line of thinking can lead to misunderstandings about Bayesian analysis. For example, a classical argument (and one which I still hear among some of my statistician friends in Berkeley and Stanford) is that Bayesian analysis is subjective because the prior is subjective. But in fact the entire model is subjective—both likelihood and prior as Andrew repeatedly states. The distinction away from the fact that you’re specifying a model is often meaningless. This relates to the old adage that “all models are wrong” (Box, 1976). Any model you work with is “subjective” according to the assumptions you’re willing to make in the model.
-
Bayesian versus frequentist models. The secret that no one wants to say publicly is that there’s no difference! A “latent variable model” or “hierarchical model” or “generative model” in Bayesian literature is just a “random effects model” in frequentist literature. I try to avoid attaching specific statistical methodologies to models because there is no difference. They are all just “probabilistic models”. The statistical methodology—whether it be Bayesian, frequentist, fiducial, whatever—is about how to reason with the model given data. That is, they are about “inference” and not the “model”. Why else do we do EM, a frequentist tool, on hierarchical models, typically labeled a Bayesian tool?
Frequentists may often use only the likelihood as the model, so it is trivially a joint distribution with a point mass distribution for . But it’s useful to keep in mind the joint distribution as it explicitly bakes in the modeling assumptions one makes.
-
Latent variables vs parameters. These days I try to stick to calling latent variables rather than parameters (and hence why I prefer to use rather than ). Parameters connote the idea of having only one setting, and it brings up the whole frequentist-Bayesian debacle about whether parameters can be random. Calling them latent variables instead simply state that the model uses random variables that remain unobserved during inference. We may be interested in inferring the latent variables, and thus it is natural to think about what means, or what it means to approximate with a point. Or we may not be directly interested in the latent variables, and thus it’s natural to integrate over them during inference.2 Somehow I find this less convincing when thinking about as “parameters”. “Parameters” introduce a concept beyond random variables and their realizations, and which I can’t formally wrap my head around.3
-
Prior belief vs prior information. I’ve started to follow Andrew in using the latter and avoiding the former. Prior belief makes Bayesian analysis sound like a religion. Prior information comes not only in the context of prior information about the latent variables but prior information about the problem (e.g., data’s likelihood) in general.
1 For clarity, I’m ignoring the nuance with probabilistic programs, in the same way we typically don’t communicate everything using measures, or by first describing the category we’re working in.
2 Random effects (Eisenhart, 1947), incidental parameters (Neyman and Scott, 1948), nuisance parameters (Kalbfleisch and Sprott, 1970), and latent data (Rubin, 1976) are often explained in frequentist literature as something to integrate over. But I don’t like them because they don’t explain what it means to treat them as random variables.
3 In practice, we often work with “parameters” in the sense that we often work with point estimates during inference. This doesn’t mean that we shouldn’t continue working with them—it’s just that in principle it’s better to think about the optimal thing (inferring the full distribution) and viewing the point estimate as an approximation to the optimal thing. That is, what we should do in principle and what we should do in practice.