http://dustintran.com/blog/
Fri, 30 Sep 2016 11:55:29 -0400NIPS 2016 Workshop on Approximate Inference
http://dustintran.com/blog/nips-2016-workshop-on-approximate-inference
http://dustintran.com/blog/nips-2016-workshop-on-approximate-inference<p>We’re organizing a NIPS workshop on approximate inference. It is together with Tamara Broderick, Stephan Mandt, and James McInerney—and alongside an incredible cast of seminal researchers: David Blei, Andrew Gelman, Mike Jordan, and Kevin Murphy. [<a href="http://approximateinference.org">Workshop homepage</a>]</p>
<p>This year, we set a theme based on what we believe are some of the most important challenges. In particular, there’s an emphasis on the practice of approximate inference, whether it be challenges which arise in applications or in software. Advances in both methodology and theory are of course crucial to achieve this end-goal; we also highly encourage such work.</p>
<p><strong>Note</strong>: We have (quite a few!) travel awards. If you’re interested in applying, the travel
award and early application deadline is October 7.</p>
<p>Call for papers below.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>We invite researchers in machine learning and statistics to participate in the:
NIPS 2016 Workshop on Advances in Approximate Bayesian Inference
Friday 9 December 2016, Barcelona, Spain
www.approximateinference.org
Submission deadline: 1 November 2016
1. Call for Participation
We invite researchers to submit their recent work on the development, analysis, or application of approximate Bayesian inference. A submission should take the form of an extended abstract of 2-4 pages in PDF format using the NIPS style. Author names do not need to be anonymized and references may extend as far as needed beyond the 4 page upper limit. If authors' research has previously appeared in a journal, workshop, or conference (including the NIPS 2016 conference), their workshop submission should extend that previous work. Submissions may include a supplement/appendix, but reviewers are not responsible for reading any supplementary material.
Submissions will be accepted either as contributed talks or poster presentations. Extended abstracts should be submitted by 1 November; see website for submission details. Final versions of the extended abstract are due by 5 December, and will be posted on the workshop website.
2. Workshop Overview
Bayesian analysis has seen a resurgence in machine learning, expanding its scope beyond traditional applications. Increasingly complex models have been trained with large and streaming data sets, and they have been applied to a diverse range of domains. Key to this resurgence has been advances in approximate Bayesian inference. Variational and Monte Carlo methods are currently the mainstay techniques, where recent insights have improved their approximation quality, provided black box strategies for fitting many models, and enabled scalable computation.
In this year's workshop, we would like to continue the theme of approximate Bayesian inference with additional emphases. In particular, we encourage submissions not only advancing approximate inference but also regarding (1) unconventional inference techniques, with the aim to bring together diverse communities; (2) software tools for both the applied and methodological researcher; and (3) challenges in applications, both in non-traditional domains and when applying these techniques to advance current domains.
This workshop is a continuation of past years:
+ NIPS 2015 Workshop: Advances in Approximate Bayesian Inference
+ NIPS 2014 Workshop: Advances in Variational Inference
This workshop has been endorsed by the International Society for Bayesian Analysis (ISBA) and is supported by Disney Research.
3. Confirmed Speakers and Panelists
Invited speakers:
Barbara Engelhardt (Princeton University)
Surya Ganguli (Stanford University)
Jonathan Huggins (MIT)
Jeffrey Regier (UC Berkeley)
Matthew Johnson (Harvard University)
Panel: Software
TBA (Stan)
Noah Goodman (WebPPL; Stanford University)
Dustin Tran (Edward; Columbia University)
TBA (TensorFlow, BayesFlow; Google)
Michael Hughes (BNPy; Harvard University)
Panel: On the Foundations and Future of Approximate Inference
Ryan Adams (Harvard University, Twitter Cortex)
Barbara Engelhardt (Princeton University)
Philip Hennig (Max Planck Institute for Intelligent Systems)
Richard Turner (University of Cambridge)
Neil Lawrence (University of Sheffield)
4. Key Dates
Travel award application deadline: 7 October 2016
Early acceptance notification: 7 October 2016
Paper submission: 1 November 2016
Acceptance notification: 16 November 2016
Travel award notification: 16 November 2016
Final paper submission: 5 December 2016
Workshop organizers:
Tamara Broderick (MIT)
Stephan Mandt (Disney Research)
James McInerney (Columbia University)
Dustin Tran (Columbia University)
Advisory committee:
David Blei (Columbia University)
Andrew Gelman (Columbia University)
Michael Jordan (UC Berkeley)
Kevin Murphy (Google)
</code></pre>
</div>
Fri, 30 Sep 2016 00:00:00 -0400Discussion of "Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing"
http://dustintran.com/blog/discussion-of-fast-approximate-inference
http://dustintran.com/blog/discussion-of-fast-approximate-inference<p><em>This article is written with much help by David Blei. It is extracted from a discussion paper on “Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing”.</em> <a href="http://arxiv.org/abs/1609.05615">[link]</a></p>
<p>We commend <a href="#wand2016fast">Wand (2016)</a> for an excellent description of
message passing (<span style="font-variant:small-caps;">mp</span>) and for developing it to infer large semiparametric
regression models. We agree with the author in fully embracing the
modular nature of message passing, where one can define “fragments”
that enable us to compose localized algorithms. We believe this
perspective can aid in the development of new algorithms for automated
inference.</p>
<p><strong>Automated inference.</strong> The promise of automated algorithms is
that modeling and inference can be separated. A user can construct
large, complicated models in accordance with the assumptions he or she
is willing to make about their data. Then the user can use generic
inference algorithms as a computational backend in a “probabilistic
programming language,” i.e., a language for specifying generative
probability models.</p>
<p>With probabilistic programming, the user no longer has to write their
own algorithms, which may require tedious model-specific derivations
and implementations. In the same spirit, the user no longer has to
bottleneck their modeling choices in order to fit the requirements of
an existing model-specific algorithm. Automated inference enables
probabilistic programming systems, such as
Stan <a href="#carpenter2016stan">(Carpenter et al., 2016)</a>, through methods like
automatic differentiation variational inference (<span style="font-variant:small-caps;">advi</span>) <a href="#kucukelbir2016automatic">(Kucukelbir, Tran, Ranganath, Gelman, & Blei, 2016)</a> and
no U-turn sampler (<span style="font-variant:small-caps;">nuts</span>) <a href="#hoffman2014nuts">(Hoffman & Gelman, 2014)</a>.</p>
<p>Though they aim to apply to a large class of models, automated
inference algorithms typically need to incorporate modeling structure
in order to remain practical. For example, Stan assumes that one can
at least take gradients of a model’s joint density. (Contrast this
with other languages which assume one can only sample from the model.)
However, more structure is often necessary: <span style="font-variant:small-caps;">advi</span> and <span style="font-variant:small-caps;">nuts</span>
are not fast enough by themselves to infer very large models, such as
hierarchical models with many groups.</p>
<p>We believe <span style="font-variant:small-caps;">mp</span> and Wand’s work could offer fruitful avenues for
expanding the frontiers of automated inference. From our perspective,
a core principle underlying <span style="font-variant:small-caps;">mp</span> is to leverage structure when it
is available—in particular, statistical properties in the model—which provides useful computational properties. In <span style="font-variant:small-caps;">mp</span>, two
examples are conditional independence and conditional conjugacy.</p>
<p><strong>From conditional independence to distributed computation.</strong>
As <a href="#wand2016fast">Wand (2016)</a> indicates, a crucial advantage of message
passing is that it modularizes inference; the computation can be
performed separately over conditionally independent posterior
factors. By definition, conditional independence separates a posterior
factor from the rest of the model, which enables <span style="font-variant:small-caps;">mp</span> to define a
series of iterative updates. These updates can be run asynchronously
and in a distributed environment.</p>
<center>
<img src="/blog/assets/2016-09-19-figure.png" style="width:200px;" />
</center>
<p><em>Figure 1.
A hierarchical model, with latent variables <script type="math/tex">\alpha_k</script> defined locally
per group and latent variables <script type="math/tex">\phi</script> defined globally to be shared across groups.</em></p>
<p>We are motivated by hierarchical models, which substantially benefit
from this property. Formally, let <script type="math/tex">y_{nk}</script> be the <script type="math/tex">n^{th}</script> data
point in group <script type="math/tex">k</script>, with a total of <script type="math/tex">N_k</script> data points in group <script type="math/tex">k</script> and
<script type="math/tex">K</script> many groups. We model the data using local latent variables
<script type="math/tex">\alpha_k</script> associated to a group <script type="math/tex">k</script>, and using global latent
variables <script type="math/tex">\phi</script> which are shared across groups. The model is depicted
in Figure 1.</p>
<p>The posterior distribution of local variables <script type="math/tex">\alpha_k</script> and global
variables <script type="math/tex">\phi</script> is</p>
<script type="math/tex; mode=display">p(\alpha,\phi\mid\mathbf{y}) \propto
p(\phi\mid\mathbf{y}) \prod_{k=1}^K
\Big[ p(\alpha_k\mid \beta) \prod_{n=1}^{N_K} p(y_{nk}\mid\alpha_k,\phi) \Big].</script>
<p>The benefit of distributed updates over the independent factors is
immediate. For example, suppose the data consists of 1,000 data points
per group (with 5,000 groups); we model it with 2 latent variables per
group and 20 global latent variables. Passing messages, or
inferential updates, in parallel provides an attractive approach to
handling all 10,020 latent dimensions. (In contrast, consider a
sequential algorithm that requires taking 10,019 steps for all other
variables before repeating an update of the first.)</p>
<p>While this approach to leveraging conditional independence is
straightforward from the message passing perspective, it is not
necessarily immediate from other perspectives. For example, the
statistics literature has only recently come to similar ideas,
motivated by scaling up Markov chain Monte Carlo using divide and
conquer strategies <a href="#huang2005sampling">(Huang & Gelman, 2005; Wang & Dunson, 2013)</a>.
These first analyze data locally over a partition of the joint
density, and second aggregate the local inferences. In our work in
<a href="#gelman2014expectation">Gelman et al. (2014)</a>, we arrive at the continuation of this
idea. Like message passing, the process is iterated, so that local
information propagates to global information and global information
propagates to local information. In doing so, we obtain a scalable
approach to Monte Carlo inference, both from a top-down view which
deals with fitting statistical models to large data sets and from a
bottom-up view which deals with combining information across local
sources of data and models.</p>
<p><strong>From conditional conjugacy to exact iterative updates.</strong>
Another important element of message passing algorithms is conditional
conjugacy, which lets us easily calculate the exact distribution for a
posterior factor conditional on other latent variables. This enables
analytically tractable messages (c.f., Equations (7)-(8) of
<a href="#wand2016fast">Wand (2016)</a>).</p>
<p>Consider the same hierarchical model discussed above, and set</p>
<script type="math/tex; mode=display">p(y_k,\alpha_k\mid \phi)
= h(y_k, \alpha_k) \exp\{\phi^\top t(y_k, \alpha_k) - a(\phi)\},</script>
<script type="math/tex; mode=display">p(\phi)
= h(\phi) \exp\{\eta^{(0) \top} t(\phi) - a(\eta_0)\}
.</script>
<p>The local factor <script type="math/tex">p(y_k,\alpha_k\mid\phi)</script> has sufficient statistics
<script type="math/tex">t(y_k,\alpha_k)</script> and natural parameters given by the global latent
variable <script type="math/tex">\phi</script>. The global factor <script type="math/tex">p(\phi)</script> has sufficient
statistics <script type="math/tex">t(\phi) = (\phi, -a(\phi))</script>, and with fixed
hyperparameters <script type="math/tex">\eta^{(0)}</script>, which has two components: <script type="math/tex">\eta^{(0)} =
(\eta^{(0)}_1,\eta^{(0)}_2)</script>.</p>
<p>This exponential family structure implies that, conditionally, the
posterior factors are also in the same exponential families
as the prior factors <a href="#diaconis1979conjugate">(Diaconis & Ylvisaker, 1979)</a>,</p>
<script type="math/tex; mode=display">p(\phi\mid\mathbf{y},\alpha)
= h(\phi) \exp\{\eta(\mathbf{y},\alpha)^\top t(\phi) - a(\mathbf{y},\alpha)\},</script>
<script type="math/tex; mode=display">p(\alpha_k\mid y_k, \phi)
= h(\alpha_k) \exp\{\eta(y_k, \phi)^\top t(\alpha_k) - a(y_k, \phi)\}
.</script>
<p>The global factor’s natural parameter is <script type="math/tex">\eta(\mathbf{y},\alpha) =
(\eta^{(0)}_1 + \sum_{k=1}^K t(y_k, \alpha_k), \eta^{(0)}_2 + \sum_{k=1}^K N_k)</script>.</p>
<p>With this statistical property at play—namely that conjugacy gives
rise to tractable conditional posterior factors—we can derive
algorithms at a conditional level with exact iterative updates. This
is assumed for most of the message passing of semiparametric models in
<a href="#wand2016fast">Wand (2016)</a>. Importantly, this is not necessarily a
limitation of the algorithm. It is a testament to leveraging model
structure: without access to tractable conditional posteriors,
additional approximations must be made. <a href="#wand2016fast">Wand (2016)</a> provides
an elegant way to separate out these nonconjugate pieces from the
conjugate pieces.</p>
<p>In statistics, the most well-known example which leverages
conditionally conjugate factors is the Gibbs sampling algorithm. From
our own work, we apply the idea in order to access fast natural
gradients in variational inference, which accounts for the information
geometry of the parameter space <a href="#hoffman2013stochastic">(Hoffman, Blei, Wang, & Paisley, 2013)</a>. In
other work, we demonstrate a collection of methods for gradient-based
marginal optimization <a href="#tran2016gradient">(Tran, Gelman, & Vehtari, 2016)</a>. Assuming forms of
conjugacy in the model class arrives at the classic idea of
iteratively reweighted least squares as well as the EM algorithm. Such
structure in the model provides efficient algorithms—both
statistically and computationally—for their automated inference.</p>
<p><strong>Open Challenges and Future Directions.</strong> Message passing is a
classic algorithm in the computer science literature, which is ripe
with interesting ideas for statistical inference. In particular,
<span style="font-variant:small-caps;">mp</span> enables new advancements in the realm of automated inference,
where one can take advantage of statistical structure in the model.
<a href="#wand2016fast">Wand (2016)</a> makes great steps following this direction.</p>
<p>With that said, important open challenges still exist in order to
realize this fusion.</p>
<p>First is about the design and implementation of probabilistic
programming languages. In order to implement <a href="#wand2016fast">Wand (2016)</a>’s
message passing, the language must provide ways of identifying local
structure in a probabilistic program. While that is enough to let
practitioners use <span style="font-variant:small-caps;">mp</span>, a much larger challenge is to
then automate the process of detecting local structure.</p>
<p>Second is about the design and implementation of inference engines.
The inference must be extensible, so that users can not only employ
the algorithm in <a href="#wand2016fast">Wand (2016)</a> but easily build on top of it.
Further, its infrastructure must be able to encompass a variety of
algorithms, so that users can incorporate <span style="font-variant:small-caps;">mp</span> as one of many
tools in their toolbox.</p>
<p>Third, we think there are innovations to be made on taking the stance
of modularity to a further extreme. In principle, one can compose not
only localized message passing updates but compose localized inference
algorithms of any choice—whether it be exact inference, Monte Carlo,
or variational methods. This modularity will enable new
experimentation with inference hybrids and can bridge the gap among
inference methods.</p>
<p>Finally, while we discuss <span style="font-variant:small-caps;">mp</span> in the context of automation,
fully automatic algorithms are not possible. Associated to all
inference are statistical and computational
tradeoffs <a href="#jordan2013statistics">(Jordan, 2013)</a>. Thus we need algorithms along
the frontier, where a user can explicitly define a computational
budget and employ an algorithm achieving the best statistical
properties within that budget; or conversely, define desired
statistical properties and employ the fastest algorithm to achieve
them. We think ideas in <span style="font-variant:small-caps;">mp</span> will also help in developing some of
these algorithms.</p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="carpenter2016stan">Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., … Riddell, A. (2016). Stan: A probabilistic programming language. <i>Journal of Statistical Software</i>.</span></li>
<li><span id="diaconis1979conjugate">Diaconis, P., & Ylvisaker, D. (1979). Conjugate Priors for Exponential Families. <i>The Annals of Statistics</i>, <i>7</i>(2), 269–281.</span></li>
<li><span id="gelman2014expectation">Gelman, A., Vehtari, A., Jylänki, P., Robert, C., Chopin, N., & Cunningham, J. P. (2014). Expectation propagation as a way of life. <i>ArXiv Preprint ArXiv:1412.4869</i>.</span></li>
<li><span id="hoffman2013stochastic">Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic Variational Inference. <i>Journal of Machine Learning Research</i>, <i>14</i>, 1303–1347.</span></li>
<li><span id="hoffman2014nuts">Hoffman, M. D., & Gelman, A. (2014). The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. <i>Journal of Machine Learning Research</i>, <i>15</i>, 1593–1623.</span></li>
<li><span id="huang2005sampling">Huang, Z., & Gelman, A. (2005). Sampling for Bayesian Computation with Large Datasets. <i>SSRN Electronic Journal</i>.</span></li>
<li><span id="jordan2013statistics">Jordan, M. I. (2013). On statistics, computation and scalability. <i>Bernoulli</i>, <i>19</i>(4), 1378–1390.</span></li>
<li><span id="kucukelbir2016automatic">Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2016). Automatic Differentiation Variational Inference. <i>ArXiv Preprint ArXiv:1603.00788</i>.</span></li>
<li><span id="tran2016gradient">Tran, D., Gelman, A., & Vehtari, A. (2016). Gradient-based marginal optimization. <i>Technical Report</i>.</span></li>
<li><span id="wand2016fast">Wand, M. P. (2016). Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing. <i>ArXiv Preprint ArXiv:1602.07412</i>.</span></li>
<li><span id="wang2013parallelizing">Wang, X., & Dunson, D. B. (2013). Parallelizing MCMC via Weierstrass sampler. <i>ArXiv Preprint ArXiv:1312.4605</i>.</span></li></ol>
Mon, 19 Sep 2016 00:00:00 -0400Blog has migrated from Ghost to Jekyll
http://dustintran.com/blog/blog-has-migrated-from-ghost-to-jekyll
http://dustintran.com/blog/blog-has-migrated-from-ghost-to-jekyll<p>In the past few days I spent time migrating the blog from
<a href="https://ghost.org">Ghost</a> to <a href="https://jekyllrb.com/">Jekyll</a>.</p>
<p>The theme builds off
<a href="https://rohanchandra.github.io/project/type/">Type Theme</a>, and is
heavily inspired by <a href="http://blog.otoro.net/">Otoro</a>, the <a href="http://www.nytimes.com/">New York
Times</a>, and the
<a href="http://the-rosenrot.com/">Rosenrot</a>. The annals-like frontpage takes
cue from Paul Graham’s <a href="http://paulgraham.com/articles.html">essays</a>
and Cosma Shalizi’s <a href="http://bactra.org/notebooks/">notebooks</a>.</p>
<p><img src="/blog/assets/2016-08-11-figure1.png" alt="" />
<img src="/blog/assets/2016-08-11-figure2.png" alt="" />
<em><center>Top: Old frontpage. Bottom: New frontpage.</center></em></p>
<p>Hooray for math! (using KaTeX)</p>
<script type="math/tex; mode=display">E(w_i,c_k,z) = -\sum_{j=1}^z w_{i,j} c_{k,j} + \Big[z\log a -\sum_{j=1}^z (w_{i,j}^2 + c_{k,j}^2)\Big].</script>
<p>Hooray for code snippets! (taken from <a href="http://edwardlib.org/getting-started">Edward</a>)</p>
<div class="language-ruby highlighter-rouge"><pre class="highlight"><code><span class="k">def</span> <span class="nf">set_seed</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="s2">"""Set seed for both NumPy and TensorFlow.
Parameters
----------
x : int, float
seed
"""</span>
<span class="n">np</span><span class="p">.</span><span class="nf">random</span><span class="p">.</span><span class="nf">seed</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">tf</span><span class="p">.</span><span class="nf">set_random_seed</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</code></pre>
</div>
<p>Hooray for bibliographies!
…
Okay I don’t quite have that one set up yet. But <a href="https://github.com/inukshuk/jekyll-scholar">it’s
doable</a>!</p>
<p>The blog’s permalinks are the same. The <a href="/blog/rss/">RSS feed</a> is the same.</p>
<h2 id="why-the-change">Why the change?</h2>
<p>I liked Ghost, having migrated from Wordpress and with the impression of a
sleeker and more modern content management system (CMS). But CMS’ have
always been a bother for me because I could never understand the
internals. It would always be a reliance on plugins in order to
accomplish things. Even with Ghost some things were too automagical
and hidden to the user.</p>
<p>Time to go even more bare-bones with a static site generator.
<a href="https://github.com/dustinvtran/blog">All code and content for this blog is available</a>.</p>
<p>Here’s to hoping the redesign motivates me to write more often!</p>
Thu, 11 Aug 2016 00:00:00 -0400Inference networks
http://dustintran.com/blog/tutorial-on-inference-networks-2
http://dustintran.com/blog/tutorial-on-inference-networks-2<p>I wrote a tutorial on <a href="http://edwardlib.org/tut_inference_networks">inference networks</a> on the Edward website. I normally don’t like writing outsourced blog posts, but I figure it’s better to link there than write duplicated content which may later become outdated.</p>
<p>Thanks go to Kevin Murphy for motivating the tutorial as it is based on our discussions, and also related discussion with Jaan Altosaar.</p>
Thu, 21 Jul 2016 00:00:00 -0400On Bayesian and frequentist, latent variables and parameters
http://dustintran.com/blog/on-bayesian-and-frequentist-latent-variables-and-parameters
http://dustintran.com/blog/on-bayesian-and-frequentist-latent-variables-and-parameters<p>I’ve been helping write tutorials that teach concepts such as <a href="https://github.com/blei-lab/edward/pull/136">black box variational inference</a> in Edward. And as I’ve been editing, I’ve noticed the majority of my suggestions are about the writing rather than the code. Communication is important—arguably more important in papers than the idea itself. Communication evokes different lines of thinking about how to approach problems, and it always has subtleties regarding the different culture and set of problems one finds important. Here are a few subtleties that have come up recently and my stock answers to them.</p>
<ol>
<li>
<p><strong>“Bayesian models” framed as placing priors on parameters of a likelihood.</strong> This is a very frequentist -> Bayesian line of thinking. I strongly believe models should simply be framed as a joint distribution <script type="math/tex">p(x, z)</script> for data <script type="math/tex">x</script> and latent variables <script type="math/tex">z</script>.<sup>1</sup> The other line of thinking can lead to misunderstandings about Bayesian analysis. For example, a classical argument (and one which I still hear among some of my statistician friends in Berkeley and Stanford) is that Bayesian analysis is subjective because the prior <script type="math/tex">p(z)</script> is subjective. But in fact the entire model is subjective—both likelihood and prior as <a href="http://andrewgelman.com/2015/08/25/can-you-change-your-bayesian-prior/">Andrew repeatedly states</a>. The distinction away from the fact that you’re specifying a model is often meaningless. This relates to the old adage that “all models are wrong” (Box, 1976). Any model you work with is “subjective” according to the assumptions you’re willing to make in the model.</p>
</li>
<li>
<p><strong>Bayesian versus frequentist models.</strong> The secret that no one wants to say publicly is that there’s no difference! A “latent variable model” or “hierarchical model” or “generative model” in Bayesian literature is just a “random effects model” in frequentist literature. I try to avoid attaching specific statistical methodologies to models because there is no difference. They are all just “probabilistic models”. The statistical methodology—whether it be Bayesian, frequentist, fiducial, whatever—is about how to reason with the model given data. That is, they are about “inference” and not the “model”. Why else do we do EM, a frequentist tool, on hierarchical models, typically labeled a Bayesian tool?</p>
<p>Frequentists may often use only the likelihood <script type="math/tex">p(x \mid z)</script> as the model, so it is trivially a joint distribution <script type="math/tex">p(x, z)</script> with a point mass distribution for <script type="math/tex">z</script>. But it’s useful to keep in mind the joint distribution as it explicitly bakes in the modeling assumptions one makes.</p>
</li>
<li>
<p><strong>Latent variables vs parameters.</strong> These days I try to stick to calling <script type="math/tex">z</script> latent variables rather than parameters (and hence why I prefer to use <script type="math/tex">z</script> rather than <script type="math/tex">\theta</script>). Parameters connote the idea of having only one setting, and it brings up the whole frequentist-Bayesian debacle about whether parameters can be random. Calling them latent variables instead simply state that the model uses random variables that remain unobserved during inference. We may be interested in inferring the latent variables, and thus it is natural to think about what <script type="math/tex">p(z \mid x)</script> means, or what it means to approximate <script type="math/tex">p(z \mid x)</script> with a point. Or we may not be directly interested in the latent variables, and thus it’s natural to integrate over them during inference.<sup>2</sup> Somehow I find this less convincing when thinking about <script type="math/tex">z</script> as “parameters”. “Parameters” introduce a concept beyond random variables and their realizations, and which I can’t formally wrap my head around.<sup>3</sup></p>
</li>
<li>
<p><strong>Prior belief vs prior information.</strong> I’ve started to follow Andrew in using the latter and avoiding the former. <a href="http://andrewgelman.com/2015/07/15/prior-information-not-prior-belief/">Prior belief makes Bayesian analysis sound like a religion</a>. Prior information comes not only in the context of prior information about the latent variables but prior information about the problem (e.g., data’s likelihood) in general.</p>
</li>
</ol>
<p><sup>1</sup> For clarity, I’m ignoring the nuance with probabilistic programs, in the same way we typically don’t communicate everything using measures, or by first describing the <a href="https://en.wikipedia.org/wiki/Category_theory">category</a> we’re working in.</p>
<p><sup>2</sup> Random effects (Eisenhart, 1947), incidental parameters (Neyman and Scott, 1948), nuisance parameters (Kalbfleisch and Sprott, 1970), and latent data (Rubin, 1976) are often explained in frequentist literature as something to integrate over. But I don’t like them because they don’t explain what it means to treat them as random variables.</p>
<p><sup>3</sup> In practice, we often work with “parameters” in the sense that we often work with point estimates during inference. This doesn’t mean that we shouldn’t continue working with them—it’s just that in principle it’s better to think about the optimal thing (inferring the full distribution) and viewing the point estimate as an approximation to the optimal thing. That is, what we should do in principle and what we should do in practice.</p>
Sun, 03 Jul 2016 00:00:00 -0400Variational auto-encoders do not train complex generative models
http://dustintran.com/blog/variational-auto-encoders-do-not-train-complex-generative-models
http://dustintran.com/blog/variational-auto-encoders-do-not-train-complex-generative-models<p>There is a <a href="http://arxiv.org/abs/1606.05908">tutorial on variational auto-encoders</a> which popped up on my arXiv radar this week. Thanks to Carl Doersch for writing this tutorial! As a researcher in this area, I believe there is sorely a need for an exposition of recent developments in variational inference. Variational inference’s application for enabling deep generative models has exploded in the past few years. This tutorial is a great step in that direction. Inspired by this and discussions with others during ICML, I’d like to make a few comments to clarify misunderstandings that are pervasive among newcomers to variational inference, Bayesian analysis, and/or auto-encoders.</p>
<p>Variational auto-encoders are <em>not</em> a way to train generative models. The generative model is <em>part</em> of the variational auto-encoder, which is typically a deep latent Gaussian model (Rezende et al., 2014). Training generative models is done by inference, typically variational inference (Hinton and Van Camp, 1993; Waterhouse et al., 1996) with an inference network. This model-inference combination is what defines a variational auto-encoder (Kingma and Welling, 2014) and more classically a Helmholtz machine (Dayan et al., 1995). Making precise the separate components of model and inference is crucial for extensibility. For example, how can we build on the inference, leveraging Monte Carlo methods (MacKay and Gibbs, 1999; Salimans et al., 2015), or build on the modeling, leveraging recurrent compositions (Gregor et al., 2015)? If I learned anything at my time at Harvard statistics with people like Don Rubin, I learned that being precise about our language prevents our discourse from being ambiguous and our terms from becoming overloaded.</p>
<p>Generative models are usually motivated from their immediate application of generating data that look like the training data. Variational auto-encoders, like, say, generative adversarial networks, are most generally a way to both postulate and infer complex data generative processes. This is possibly for unsupervised tasks, possibly supervised tasks, semi-supervised tasks, or even causal inferences. They are part of an underlying statistical methodology.<sup>1</sup> This methodology simply frames models as distributions over data and latent variables, allowing models to address a broad array of downstream tasks with any size of data. For example, deep generative models can achieve “state-of-the-art” for supervised learning on small amounts of data (Damianou and Lawrence, 2013) and generalization with few examples (Salakhutdhinov et al., 2013).</p>
<p>The neural network used in the encoder (variational distribution) does not lead to any richer approximating distribution. It is a way to amortize inference such that the number of parameters does not grow with the size of the data (an incredible feat, but not one for expressivity!) (Stuhlmüller et al., 2013). This is better explained from the perspective of variational inference than auto-encoders. For example, suppose we have a model with latent variables that are Gaussian a priori (as in VAEs). We may choose a fully factorized Gaussian as the variational distribution, where each latent variable is rendered independent. The inference network takes data as input and outputs the local variational parameters relevant to each data point. The optimal inference network outputs the set of Gaussian parameters which maximizes the variational objective. This means a variational auto-encoder with a perfect inference network can only do as well as a fully factorized Gaussian with no inference network.<sup>2</sup></p>
<p><sup>1</sup> The underlying statistical methodology of latent variable models is typically Bayesian. However, the distinction between frequentist and Bayesian models is practically none, e.g., Harville (1977).</p>
<p><sup>2</sup> Much recent work, including my own, has tried to improve this simple approximation to improve our approximate inference. I invite you to the incredible developments since the past year: Rezende and Mohamed, 2015; Salimans et al., 2015; Tran et al., 2015; Tran et al., 2016; Burda et al., 2016; Ranganath et al., 2016; Maaløe et al., 2016; Mnih and Rezende, 2016; Louizos et al., 2016; Johnson et al., 2016; Dinh et al., 2016; Kingma et al., 2016.</p>
<h2 id="references">References</h2>
<ul>
<li>Burda, Y., Grosse, R., & Salakhutdinov, R. (2016). Importance Weighted Autoencoders. In <em>International Conference on Learning Representations.</em></li>
<li>Damianou, A. C., & Lawrence, N. D. (2013). Deep Gaussian Processes. In <em>Artificial Intelligence and Statistics.</em></li>
<li>Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz Machine. Neural Computation, 7(5), 889–904. http://doi.org/10.1162/neco.1995.7.5.889</li>
<li>Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using Real NVP. <em>arXiv.org.</em></li>
<li>Harville, D. A (1977). Maximum likelihood approaches to variance component estimation and to related problems. <em>Journal of the American Statistical Association</em>, 72(358):320–338.</li>
<li>Hinton, G. and Van Camp, D (1993). Keeping the neural networks simple by minimizing the description length of the weights. In <em>Computational Learning Theory</em>, pp. 5–13. ACM.</li>
<li>Johnson, M. J., Duvenaud, D., Wiltschko, A. B., Datta, S. R., & Adams, R. P. (2016). Composing graphical models with neural networks for structured representations and fast inference. <em>arXiv.org</em>.</li>
<li>Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. In <em>International Conference on Learning Representations</em>.</li>
<li>Kingma, D. P., Salimans, T., & Welling, M. (2016). Improving Variational Inference with Inverse Autoregressive Flow. <em>arXiv.org</em>.</li>
<li>Louizos, C., & Welling, M. (2016). Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors. In <em>International Conference on Machine Learning</em>.</li>
<li>Maaløe, L., Sønderby, C. K., Sønderby, S. K., & Winther, O. (2016). Auxiliary Deep Generative Models. In <em>International Conference on Machine Learning</em>.</li>
<li>MacKay, D. J., & Gibbs, M. N. (1999). Density networks. <em>Statistics and neural networks: advances at the interface</em>. Oxford University Press, Oxford, 129-144.</li>
<li>Mnih, A., & Rezende, D. J. (2016). Variational inference for Monte Carlo objectives. In <em>International Conference on Machine Learning</em>.</li>
<li>Ranganath, R., Tran, D., & Blei, D. M. (2016). Hierarchical Variational Models. In <em>International Conference on Machine Learning</em>.</li>
<li>Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In <em>International Conference on Machine Learning</em>.</li>
<li>Salimans, T., Kingma, D. P., & Welling, M. (2015). Markov Chain Monte Carlo and Variational Inference: Bridging the Gap. In <em>International Conference on Machine Learning</em>.</li>
<li>Salakhutdinov, R., Tenenbaum, J. B., and Torralba, A (2013). Learning with hierarchical-deep models. <em>Pattern Analysis and Machine Intelligence, IEEE Transactions on</em>, 35 (8):1958–1971.</li>
<li>Stuhlmüller, A., Taylor, J., & Goodman, N. (2013). Learning Stochastic Inverses. In <em>Neural Information Processing Systems</em>.</li>
<li>Tran, D., Blei, D. M., & Airoldi, E. M. (2015). Copula variational inference. In <em>Neural Information Processing Systems</em>.</li>
<li>Tran, D., Ranganath, R., & Blei, D. M. (2016). The Variational Gaussian Process. <em>International Conference on Learning Representations</em>.</li>
<li>Waterhouse, S., MacKay, D., and Robinson, T (1996). Bayesian methods for mixtures of experts. In <em>Neural Information Processing Systems</em>.</li>
</ul>
Thu, 23 Jun 2016 00:00:00 -0400A quick update: Edward, and some motivations
http://dustintran.com/blog/a-quick-update-edward-and-some-motivations
http://dustintran.com/blog/a-quick-update-edward-and-some-motivations<p>As you may (or may not) know, I’ve been busy lately spear-heading <a href="https://github.com/blei-lab/edward">Edward</a>, an open-source library for probabilistic modeling. It’s meant to help bridge the gap between what I view as two dichotomous approaches of statistical learning: one approach develops complex models in order to achieve the best results on a specific task; and the other approach adheres to simple models in order to understand every component of the analysis both empirically and theoretically. The former starts at the end-goal; the latter starts at the foundation.</p>
<p>There are many names you can append to one of these approaches, and certainly those names imply contrasting motivations due to culture, and thus contrasting views and contrasting applications. But ultimately, the goal is still the same. The approaches are not orthogonal, but a lack of awareness connecting the two make them seem to be. A neural network is a powerful aproach for modeling non-linear functions (I mean this not tongue-in-cheek; it’s difficult to summarize many decades of innovation in a sentence). A Bayesian linear model is a powerful approach for incorporating parameter uncertainty during supervision, and for accessing a basis on which to validate our models.</p>
<p>How do we <a href="https://www.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf">assess model fit</a> from a complex neural network architecture, and <a href="http://arxiv.org/abs/1206.6471">generalize to any setting</a>, whether it be <a href="https://sites.google.com/site/dataefficientml/">small data</a>, <a href="https://sites.google.com/site/dlworkshop16/">simulation-based tasks</a>, or <a href="http://steinhardt.nyu.edu/priism/newsandevents/conferences">even causal inferences</a>? How do we build <a href="http://link.springer.com/book/10.1007/978-1-4612-0745-0">more expressive models</a> when <a href="http://authors.library.caltech.edu/13796/1/MACnc92d.pdf">our</a> <a href="http://www.stat.columbia.edu/~gelman/research/published/A6n41.pdf">tools</a> tell us the generalized linear model fits okay, but is not nearly as fine-grained as we’d like it to be, or let our inferences <a href="http://leon.bottou.org/publications/pdf/compstat-2010.pdf">scale</a> to data that no longer fits in memory? If we use the <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">many</a> <a href="http://www.gatsby.ucl.ac.uk/~dayan/papers/hm95.pdf">innovations</a> of deep learning in concert with statistical analysis, or conversely, the <a href="https://en.wikipedia.org/wiki/Statistical_Methods_for_Research_Workers">century of statistical foundations</a> in deep learning, we might just achieve something quite grand. (And I mean this one only a little tongue-in-cheek.)</p>
<p>Edward, broadly speaking, tries to combine efforts from both approaches. It’s a software library I’ve always wanted to develop but never had the right resources until now. Fast and distributed computation can be done using <a href="https://www.tensorflow.org">TensorFlow</a> as a rich symbolic framework. Neural networks can be easily constructed with high-level libraries like <a href="http://keras.io">Keras</a>. Flexible probability models can be specified using languages such as <a href="http://mc-stan.org">Stan</a>. And all of inference can be done using fast approximations via a built-in variational inference engine, with criticism techniques for both point prediction and distribution-based model assessments.</p>
<p>I gave an Edward talk a few days ago at Google Brain— which I quite positively butchered due to a lack of sleep and preparation. (If only I had known I would get a sizeable cast of the TensorFlow core developers, Geoff Hinton, Jeff Dean, and Kevin Murphy in one room!) But I think the work explained itself, and with a lot of excitement about why Bayesian deep learning might be the right thing.</p>
<p>There are a number of necessary developments in Edward to even make small steps in connecting these two approaches. I finally got to refactoring the code for <a href="https://github.com/blei-lab/edward/blob/master/examples/convolutional_vae.py">variational auto-encoders</a>; and designing the <a href="https://github.com/blei-lab/edward/pull/107">criticism API</a> for posterior predictive checks; and getting <a href="http://cbonnett.github.io/MDN_EDWARD_KERAS_TF.html">much help from others</a> for explaining how to build neural network-based probabilistic models; and many open problems to <a href="http://dustintran.com/papers/TranRanganathBlei2016.pdf">try</a> <a href="http://arxiv.org/abs/1603.00788">to</a> <a href="http://arxiv.org/abs/1511.02386">solve</a>. It’s not even close to there for getting statisticians, machine learners, and deep learners to agree with one another. But I think Edward is making progress.</p>
Mon, 30 May 2016 00:00:00 -0400Video resources for machine learning
http://dustintran.com/blog/video-resources-for-machine-learning
http://dustintran.com/blog/video-resources-for-machine-learning<p>One feature unique to our field is the sheer amount of online resources. It’s one of the reasons I personally got into machine learning over other disciplines, as self-learning is much more accessible.</p>
<p>This is especially true for video resources. Part of this is because as a field we are very well integrated with computer science, and thus we’re technologically more motivated. (However, arguably no other computer science research domain has as many resources as us.) Whatever the reason, it’s important that we take advantage of all the information available in order to maximize learning.</p>
<p>I haven’t found a good collated list of machine learning videos, so I decided to compile my own. In general the following are excellent resources:</p>
<ul>
<li><a href="http://videolectures.net/">VideoLectures.NET</a> is the primary video archive for machine learning. Browse the homepage and you’ll see almost all machine learning videos featured on the frontpage, ranging from conferences, workshops, lectures, and even discussions.</li>
<li><a href="http://techtalks.tv/">TechTalks.tv</a> is the second most used
archive. Notably, many ICML videos are located here.</li>
<li><a href="http://youtube.com/">Youtube</a> contains a few, such as certain
AISTATS and ICLR years. It’s best for collecting lectures, such as by <a href="https://www.youtube.com/playlist?list=PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu">Nando de Freitas</a> and <a href="https://www.youtube.com/playlist?list=PLZSO_6-bSqHTTV7w9u7grTXBHMH-mw3qn">Alex Smola</a>.</li>
</ul>
<p>Here are all the video archives specific to conferences and workshops that I could find. ¹ Over this winter I’ve been relaxing and marathoning these as if they were TV shows—it’s quite a lot of fun!</p>
<ul>
<li><a href="http://research.microsoft.com/apps/catalog/default.aspx?p=1&sb=no&ps=25&t=videos&sf=&s=&r=&vr=&ra=">2015: Neural Information Processing Systems</a></li>
<li><a href="https://www.youtube.com/playlist?list=PLhoHEZlJjdQJLRUSE9_55eXkwNXKTiNQF">2015: Gaussian Process Summer School, Sheffield</a></li>
<li><a href="http://videolectures.net/deeplearning2015_montreal/">2015: Deep Learning Summer School, Montreal</a></li>
<li><a href="https://www.youtube.com/channel/UCXDf7Y4KMcqPWHriorcMDNg/videos">2015: Uncertainty in Artificial Intelligence</a></li>
<li><a href="https://www.youtube.com/playlist?list=PLGJm1x3XQeK0NgFOX2Z7Wt-P5RU5Zv0Hv">2015: Arthur M. Sackler Colloquia: Drawing Causal Inference from Big Data</a></li>
<li><a href="https://www.youtube.com/playlist?list=PLhiWXaTdsWB8PnrVZquVyqlRFWXM4ijYz">2015: International Conference on Learning Representations</a></li>
<li><a href="http://videolectures.net/colt2015_paris/">2015: Conference on Learning Theory</a></li>
<li><a href="https://www.youtube.com/channel/UCT1k2e63pqm_VSXmaF21n6g/videos">2015: Machine Learning Summer School, Sydney</a></li>
<li><a href="http://www.fields.utoronto.ca/video-archive/event/316/2014">2014: UToronto Workshop on Big Data and Statistical Machine Learning</a></li>
<li><a href="https://nips.cc/Conferences/2014/Schedule">2014: Neural Information Processing Systems</a></li>
<li><a href="https://www.youtube.com/playlist?list=PLhoHEZlJjdQKgrpK70Ym04Ju3-9W-QXns">2014: Gaussian Process Summer School, Sheffield</a></li>
<li><a href="https://www.youtube.com/playlist?list=PLZSO_6-bSqHQCIYxE3ycGLXHMjK3XV7Iz">2014: Machine Learning Summer School, Pittsburgh</a></li>
<li><a href="http://techtalks.tv/icml2014/">2014: International Conference on Machine Learning</a></li>
<li><a href="http://videolectures.net/colt2014_barcelona/">2014: Conference on Learning Theory</a></li>
<li><a href="https://www.youtube.com/channel/UCQeS3L6d-S6ZeCQChFyK5Uw">2014: Artificial Intelligence and Statistics</a></li>
<li><a href="https://www.youtube.com/playlist?list=PLhoHEZlJjdQKI1cs5yPRUYdgcsE0HctoQ">2014: Gaussian Process Winter School, Sheffield</a></li>
<li><a href="https://www.youtube.com/channel/UC3ywjSv5OsDiDAnOP8C1NiQ">2014: Machine Learning Summer School, Iceland</a></li>
<li><a href="http://videolectures.net/nipsworkshops2013_laketahoe/">2013: Neural Information Processing Systems</a></li>
<li><a href="http://techtalks.tv/icml2013/">2013: International Conference on Machine Learning</a></li>
<li><a href="https://www.youtube.com/playlist?list=PLqJm7Rc5-EXFv6RXaPZzzlzo93Hl0v91E">2013: Machine Learning Summer School, Tuebingen</a></li>
<li><a href="http://videolectures.net/colt2013_princeton/">2013: Conference on Learning Theory</a></li>
<li><a href="https://www.youtube.com/playlist?list=PLLoKvRqQVbtI-ZhBpBrKT2KllP5dH8O0M">2013-14: Harvard Institute for Quantitative Social Science, Applied Statistics Workshop</a></li>
<li><a href="https://icerm.brown.edu/video_archive/">2012: Brown ICERM workshops</a> (the <a href="https://icerm.brown.edu/sp-f12-w1/">2012 Bayesian nonparametrics</a> is particularly good)</li>
<li><a href="http://videolectures.net/nips2012_laketahoe/">2012: Neural Information Processing Systems</a></li>
<li><a href="http://videolectures.net/uai2012_catalinaislands/">2012: Uncertainty in Artificial Intelligence</a></li>
<li><a href="http://techtalks.tv/search/results/?q=icml+2012">2012: International Conference on Machine Learning</a></li>
<li><a href="http://videolectures.net/aaai2012_toronto/">2012: AAAI Conference on Artificial Intelligence</a></li>
<li><a href="http://videolectures.net/nips2011_granada/">2011: Neural Information Processing Systems</a></li>
<li><a href="http://videolectures.net/uai2011_barcelona/">2011: Uncertainty in Artificial Intelligence</a></li>
<li><a href="http://videolectures.net/colt2011_budapest/">2011: Conference on Learning Theory</a></li>
<li><a href="http://videolectures.net/aistats2011_fortlauderdale/">2011: Artificial Intelligence and Statistics</a></li>
<li><a href="http://videolectures.net/nips2010_vancouver/">2010: Neural Information Processing Systems</a></li>
<li><a href="http://videolectures.net/icml2010_haifa/">2010: International Conference on Machine Learning</a></li>
<li><a href="http://videolectures.net/aistats2010_sardinia/">2010: Artificial Intelligence and Statistics</a></li>
<li><a href="http://videolectures.net/nips09_vancouver/">2009: Neural Information Processing Systems</a></li>
<li><a href="http://videolectures.net/mlss09uk_cambridge/">2009: Machine Learning Summer School, Cambridge</a></li>
<li><a href="http://videolectures.net/icml09_montreal/">2009: International Conference on Machine Learning</a></li>
<li><a href="http://videolectures.net/kdd08_las_vegas/">2008: Knowledge Discovery and Data Mining</a></li>
<li><a href="http://videolectures.net/uai08_helsinki/">2008: Uncertainty in Artificial Intelligence</a></li>
<li><a href="http://videolectures.net/icml08_helsinki/">2008: International Conference on Machine Learning</a></li>
<li><a href="http://videolectures.net/icml07_corvallis/">2007: International Conference on Machine Learning</a></li>
<li><a href="http://videolectures.net/nips06_whistler/">2006: Neural Information Processing Systems</a></li>
<li><a href="http://videolectures.net/icml05_bonn/">2005: International Conference on Machine Learning</a></li>
</ul>
<p>Hope this is of use. I wish I had time to go further and collect resources regarding specific ideas or techniques, such as David Blei’s 2009 lecture on <a href="http://videolectures.net/mlss09uk_blei_tm/">Topic models</a>, or Peter Orbanz’s 2009 lecture on <a href="http://videolectures.net/mlss09uk_orbanz_fnbm/">Foundations of Bayesian nonparametrics</a>. Perhaps someone not me will do this in the future.</p>
<p>¹ If you have links to others not listed here and that you feel should be added, or if there is a broken link, please contact me at <a href="mailto:dustin@cs.columbia.edu">dustin@cs.columbia.edu</a>. Any help is appreciated and would also help the community at large. :-)</p>
Sun, 03 Jan 2016 00:00:00 -0500Stochastic Expectation Propagation
http://dustintran.com/blog/stochastic-expectation-propagation
http://dustintran.com/blog/stochastic-expectation-propagation<ul>
<li>Yingzhen Li, Jose Miguel Hernandez-Lobato, and Richard E. Turner. Stochastic Expectation Propagation. In <em><a href="http://papers.nips.cc/paper/5760-stochastic-expectation-propagation.pdf">Neural Information Processing Systems</a></em>, 2015.</li>
</ul>
<h2 id="summary">Summary</h2>
<p>Expectation propagation (EP) is a popular technique for approximate Bayesian inference, although it has arguably lost favor in recent years to variational inference. Whereas variational inference was made scalable to massive data sets through stochastic variational inference (Hoffman et al., 2013), and thus attractive to machine learning practitioners, expectation propagation is limited by a memory overhead which scales with the number of data points. This paper proposes an extension of EP which removes the complexity requirement and thus enables it to scale to massive data.</p>
<p>Let <script type="math/tex">\mathbf{x}=\{\mathbf{x}_k\}</script> be a data set split into <script type="math/tex">K</script> components, and <script type="math/tex">p(\mathbf{x}\mid\mathbf{\theta})</script> be a model likelihood with prior <script type="math/tex">p(\mathbf{\theta})</script>. We aim to learn an approximating distribution <script type="math/tex">q(\mathbf{\theta})</script> such that</p>
<script type="math/tex; mode=display">p(\mathbf{\theta}\mid \mathbf{x})
\propto
p(\mathbf{\theta})\prod_{k=1}^K p(\mathbf{x}_k\mid\mathbf{\theta})
\approx
q(\mathbf{\theta})
\propto
p(\mathbf{\theta})\prod_{k=1}^K f_k(\mathbf{\theta}).</script>
<p>Following the way Zoubin Ghahramani motivates EP, we would like to minimize <script type="math/tex">\mathrm{KL}(p(\mathbf{\theta}\mid \mathbf{x})\|q(\mathbf{\theta}))</script> (I have no clue who first motivated EP from this way—Zoubin’s slides were how I first learned this). This can be interpreted as a reverse analog of the variational objective: whereas VI is restricted to the support of the posterior and is “mode-seeking” (Minka, 2005), this KL contains <em>at least</em> the support of the posterior and is “inclusive” (Frey, 2000). Minimizing this KL could proceed by iteratively updating the factors <script type="math/tex">f_k(\mathbf{\theta})</script>. Denoting the leave-one-out posterior <script type="math/tex">p_{-k}(\mathbf{\theta})=p(\mathbf{\theta}\mid \mathbf{x})/p(\mathbf{x}_k\mid\mathbf{\theta})</script>, we iteratively minimize <script type="math/tex">\mathrm{KL}(p(\mathbf{\theta}\mid \mathbf{x})\|p_{-k}(\mathbf{\theta})f_k(\mathbf{\theta}))</script> for each <script type="math/tex">k=1,\ldots,K</script>.</p>
<p>However, <script type="math/tex">\mathrm{KL}(p\|q)</script> is intractable (and so is the iterative version) as it requires calculating the posterior density; therefore we minimize an approximation to it. Consider approximating the leave-one-out posterior with the <em>cavity distribution</em> <script type="math/tex">q_{-k}(\mathbf{\theta}) = q(\mathbf{\theta})/f_k(\mathbf{\theta})</script>, which is the approximating distribution without the <script type="math/tex">k^{th}</script> likelihood approximation. Define the <em>tilted distribution</em> <script type="math/tex">q_{\backslash k}(\mathbf{\theta})=q_{-k}(\mathbf{\theta}) p(\mathbf{x}_k\mid\mathbf{\theta})</script>, which is the corresponding cavity distribution with the true likelihood factor plugged in. EP iteratively solves</p>
<script type="math/tex; mode=display">\min_{f_k}\mathrm{KL}(q_{\backslash k}(\mathbf{\theta}) \| q_{-k}(\mathbf{\theta}) f_k(\mathbf{\theta})),\quad k=1,\ldots,K.</script>
<p>This objective is a good proxy to <script type="math/tex">\mathrm{KL}(p\|q)</script> if the true likelihood factor <script type="math/tex">p(\mathbf{x}_k\mid\mathbf{\theta})</script> contributes to the leave-one-out posterior in the same way as it does in the tilted distribution.</p>
<p>Calculating the cavity distribution for each <script type="math/tex">k</script> requires explicit storage of the approximating factors <script type="math/tex">f_k(\mathbf{\theta})</script>. This implies that EP has an <script type="math/tex">\mathcal{O}(K)</script> memory requirement which prevents it from scaling to large data sets. To solve this, the authors consider storage of only a single factor <script type="math/tex">f(\mathbf{\theta})</script>; rather than trying to capture the individual effect of the likelihood factors, the single factor captures only the average effect. Stochastic expectation propagation (SEP) uses the same iterative KL objective to learn the next approximating factor <script type="math/tex">f_k(\mathbf{\theta})</script>, and additionally updates <script type="math/tex">f(\mathbf{\theta})=f(\mathbf{\theta})^{1-1/K}f_k(\mathbf{\theta})^{1/K}</script>. By storing only the average effect of the likelihood factors, SEP approximates the true cavity distributions <script type="math/tex">q_{-k}(\mathbf{\theta})\propto q(\mathbf{\theta})/f_k(\mathbf{\theta})</script> with an averaged cavity distribution <script type="math/tex">q_{-1}(\mathbf{\theta})\propto q(\mathbf{\theta})/f(\mathbf{\theta})</script>.</p>
<h2 id="discussion">Discussion</h2>
<p>I’m a big fan of the paper. The exposition on EP, assumed density filtering, and message passing is excellent, and it naturally motivates SEP following the technical developments of EP. The use of an averaged cavity distribution is an unfortunate compromise, but it seems necessary in order to scale local approximation techniques such as EP.</p>
<p>The authors make a cool connection to SVI. If the local minimizations in SEP are instead doing the reverse KL, then this is a form of “stochastic” variational message passing. Moreover, if the update for the single factor <script type="math/tex">f(\mathbf{\theta})</script> follows <script type="math/tex">f(\mathbf{\theta})^{1 - \epsilon}f_n(\mathbf{\theta})^{\epsilon}</script> where <script type="math/tex">\epsilon</script> uses a Robbins-Monro schedule, then this is the same as stochastic variational inference with a mean-field approximation. An interesting extension then is to apply the equivalent of an adaptive learning rate in stochastic approximations for the geometric scale factor when updating <script type="math/tex">f(\mathbf{\theta})</script>.</p>
<p>Hogwild (Nui et al., 2011) and asynchronous versions also naturally follow which is cool. I also think it would have been useful for the paper to fight for the streaming data stance. That is, it’s trivial for stochastic EP to learn on streaming data whereas it is impossible for vanilla EP.</p>
<p>The move to <script type="math/tex">\mathrm{KL}(p\|q)</script> with SEP does require some compromises in contrast to SVI unfortunately. For instance, unlike VI to SVI, EP to SEP compromises the approximation quality in order to make it scalable. That is, SVI still performs the same global minimization of <script type="math/tex">\mathrm{KL}(q\|p)</script>, and shares nice properties through theory of stochastic approximations (Robbins and Monro, 1951), whereas SEP does not preserve the same local minimization as EP. You can imagine scenarios where an averaged likelihood approximator will not work, or even the “minibatch” M of K extension of them. Additionally, it’s not immediately clear how to extend EP beyond approximations which preserve dependence between the local approximations.</p>
<p>Minor comment: like much of Tom Minka’s work, they note that the local KL minimizations can more generally use <script type="math/tex">\alpha</script>-divergence measures. I still think this is a little restricting, as in general you just want <em>some</em> approximation; therefore something like a Laplace approximation (Smola et al., 2004) is valid as well, and similarly, nested EP (Riihimaki et al., 2013) is more conducive following this motivation.</p>
<p>I’m excited to hear more about stochastic EP from their poster and spotlight at NIPS! (The same authors have a number of interesting extensions of this paper at a few workshops as well.)</p>
Sun, 06 Dec 2015 00:00:00 -0500At NIPS 2015
http://dustintran.com/blog/at-nips-2015
http://dustintran.com/blog/at-nips-2015<p>I’m quite excited for <a href="https://nips.cc">NIPS</a> this year, which starts this Monday.</p>
<p>It’s always interesting to look at the recent trends from browsing the schedule. Deep learning and reinforcement learning are of course a big highlight of NIPS, with <em>three</em> tutorials, a symposium, and many affiliated workshops. Approximate inference is becoming increasingly more mainstream as complex probabilistic models are fit with scalable inference strategies. Similarly, probabilistic programming—and the general concept of developing generic algorithms which fit broad classes of models—is becoming crucial to model prototyping. Gaussian processes are big just as they were in the last ICML: Bayesian optimization, Bayesian nonparametrics, and kernel methods are in general becoming more feasible and used in practice.</p>
<p>On personal details, I’ll be presenting a poster on <a href="http://dustintran.com/papers/TranBleiAiroldi2015.pdf">Copula variational inference</a>, which is joint work with Dave Blei and Edo Airoldi, trying to tackle the general problem of learning posterior correlation in variational inference. I’m also helping organize the <a href="http://approximateinference.org">Advances in Approximate Bayesian Inference</a> workshop on Friday. We have an incredible list of talks lined up, and two panels which I’m sure will be insightful. On Saturday, I’ll be giving a talk at the <a href="http://www.blackboxworkshop.org">Black Box Learning and Inference</a> workshop. It covers my most recent research on developing a rich framework for variational inference (abstract below). The <a href="http://mc-stan.org/misc/nips">Stan team</a> also has a webpage on NIPS.</p>
<blockquote>
<p>Title: Variational models</p>
<p>Abstract:</p>
<p>Originally developed in the 1990s, variational inference has enjoyed renewed interest around scalable optimization for large datasets, black box strategies for inferring broad model classes, and neural networks for amortized inference. However, this research has been constrained by the lack of an expressive family of approximating distributions, which can generalize across many models. I will present the framework of variational models, which interpret the approximating family as a model of the posterior latent variables. This naturally leads to Bayesian constructs for developing expressive approximations. Following this, I will present a particular instantiation termed the variational Gaussian process. With applications to auto-encoders and deep recurrent attention models, I show that the variational model’s complexity grows efficiently and towards any distribution.</p>
<p>Joint work with Rajesh Ranganath and Dave Blei.</p>
</blockquote>
Fri, 04 Dec 2015 00:00:00 -0500Denoising Criterion for Variational Auto-Encoding Framework
http://dustintran.com/blog/denoising-criterion-for-variational-auto-encoding-framework
http://dustintran.com/blog/denoising-criterion-for-variational-auto-encoding-framework<ul>
<li>Daniel Jiwoong Im, Sungjin Ahn, Roland Memisevic, and Yoshua Bengio. Denoising Criterion for Variational Auto-Encoding Framework. <a href="http://arxiv.org/abs/1511.06406">arXiv preprint arXiv:1511.06406</a>, 2015.</li>
</ul>
<h2 id="summary">Summary</h2>
<p>Auto-encoders are used to learn representations of observed data <script type="math/tex">x</script>. They do so by minimizing the reconstruction error of a decoder <script type="math/tex">p(x\mid z)</script>, given a characterization of hidden variables <script type="math/tex">z</script> from an encoder <script type="math/tex">q(z\mid x)</script>. Along with a penalty <script type="math/tex">P_\alpha</script> for regularization, this can be written as maximizing the objective</p>
<script type="math/tex; mode=display">\mathcal{L}_{\text{ae}} = \mathbb{E}_{q(z\mid x)}[\log p(x\mid z)] - P_\alpha</script>
<p>over the encoder and decoder’s parameters. However, it’s unclear what a principled choice of the penalty is, as well as the choice of the encoder for a given decoder <script type="math/tex">p(x\mid z)</script>. Variational auto-encoders (Kingma and Welling, 2014; Rezende et al., 2014) reinterpret the evidence lower bound prominent in variational inference for this purpose:</p>
<script type="math/tex; mode=display">\mathcal{L}_{\text{vae}} = \mathbb{E}_{q(z\mid x)}[\log p(x\mid z)] - \mathrm{KL}(q(z\mid x)\| p(z)),</script>
<p>where <script type="math/tex">p(z)</script> is a model prior placed over the hidden variables of the encoder. The penalty term is automatically given as the KL divergence between the encoder and the prior, and the encoder can more generally be specified following variational inference literature. This is now a probabilistic encoder-decoder, with a principled objective given by a lower bound to the model evidence <script type="math/tex">\log p(x)</script>.</p>
<p>In this paper, Im et al. take ideas from denoising auto-encoders (Vincent et al., 2008) and apply it to variational auto-encoders. In particular, they consider a <em>corrupted distribution</em> <script type="math/tex">q(x'\mid x)</script>, such that</p>
<script type="math/tex; mode=display">q(z\mid x) = \int q(z\mid \tilde{x})q(\tilde{x}\mid x)\mathrm{d}\tilde{x}.</script>
<p>With a distribution to model corrupted noise, the augmented <script type="math/tex">q(z\mid x)</script> is more expressive than the original <script type="math/tex">q(z\mid x)</script> without corruption: it is able to model additional uncertainty around the data inputs.</p>
<p>For inference, plugging this variational distribution into the ELBO leads to an analytically intractable objective, as the log-density <script type="math/tex">\log q(z\mid x)</script> cannot be computed. As a proxy, they propose maximizing the objective function</p>
<script type="math/tex; mode=display">\mathcal{L}_{\text{dvae}} = \mathbb{E}_{q(\tilde{x}\mid x)}\Big[\mathbb{E}_{q(z\mid x)}\Big[\log\frac{p(x, z)}{q(z\mid\tilde{x})}\Big]\Big],</script>
<p>which they show is in fact an upper bound to the original ELBO. They also claim it is still a lower bound to the model evidence <script type="math/tex">\log p(x)</script>, and thus still a valid variational objective (I’ll comment on this later). Gradients proceed as typically done in the literature, either using the reparameterized gradient for differentiable latent variables, or the score function estimator in the more general case. They have an interesting Q&A format for discussing their experiments that I found kind of cool.</p>
<h2 id="discussion">Discussion</h2>
<p>This paper is eerily similar to a paper Rajesh Ranganath, Dave Blei, and I uploaded just two weeks earlier than theirs, on <a href="http://arxiv.org/abs/1511.02386">Hierarchical Variational Models</a>. And that was uploaded only three weeks ago from today! Man, the field moves fast.</p>
<ul>
<li>The idea is the same: place a prior on inputs to the variational distribution—treating them as <em>variational</em> latent variables—and proceed to marginalize them out. The new inputs are now the inputs to the prior distribution. In their case, the prior is on the data input for inference networks, which output local variational parameters. In our case, the prior is on variational parameters more generally. A prior on data input is the special case when the variational model uses an inference network: following our notation, we replace <script type="math/tex">q(z;\lambda)</script> with the more powerful <script type="math/tex">q(z;\theta)=\int q(z\mid\lambda)q(\lambda;\theta)\mathrm{d}\lambda</script>, and use an inference network <script type="math/tex">\phi</script> to output prior hyperparameters <script type="math/tex">\theta=\phi(x)</script> (see end of Section 5).</li>
<li>The source of inspiration is different: they follow the standard procedure in denoising auto-encoders. We follow the Bayesian framework, viewing the variational distribution as a model of the posterior latent variables. These different sources of inspiration motivate difference choices of complexity for the prior: they only consider Bernoulli or Gaussian, whereas from the Bayesian interpretation, it is natural to study more complex priors such as the normalizing flow (Rezende and Mohamed, 2015) as a prior; we show the latter leads to more expressive approximations and also more scalable computation. (And in our more recent paper on the <a href="http://arxiv.org/abs/1511.06499">Variational Gaussian Process</a>, we use an even cooler prior :^).</li>
<li>The variational objectives are different, and I have concerns that their objective is not in fact a lower bound on <script type="math/tex">\log p(x)</script>.¹ This means the values of the objective function are no longer a proxy for the model evidence. Among a few things, this invalidates the experimental results, as they compare to methods based on reported variational bounds. Further, maximizing their objective can lead to poor approximations, as there is no stopping the function to blow up to arbitrary values as the global maximum. Actually, it surprises me that the maximized variational objective values in the experiments are still close to those in <script type="math/tex">\mathcal{L}_{\text{vae}}</script>. Would be very interested to hear from the authors on these issues.</li>
</ul>
<p>Regardless of whether the theorems and experiments are correct, denoising variational auto-encoders is a much needed idea. It provides great intuition for why one should use hierarchical instantiations of variational models. Moreover it’s always interesting to see the same ideas inspired from a different field. These intuitions especially go nicely with our <a href="http://arxiv.org/abs/1511.06499">VGP paper</a>, in which we make more explicit the auto-encoder connection for learning variational models with latent variables. Maybe we’ll add these details in a future revision.</p>
<p>Minor point: They claim that injecting noise both at the data input level (“denoising auto-encoder”) and the stochastic hidden layer level (“variational auto-encoder”) can be advantageous. However, they’re really one and the same, which I’m sure they’re aware of as they wrote Lemma 1 in this light. That is, the corrupted hidden variable <script type="math/tex">\tilde{x}</script> is just another stochastic hidden layer, which is marginalized out. This makes the abstract and introduction a little confusing.</p>
<p>¹ In the proof of Lemma 1, when they try to show that <script type="math/tex">\log p(x)</script> upper bounds <script type="math/tex">\mathcal{L}_{\text{dvae}}</script>, there are three expectations expanded out in the first line. If the two expectations regarding the distribution <script type="math/tex">q(z'\mid x)</script> are more explicitly written, it is clear why the <script type="math/tex">q</script>’s do not necessarily cancel:</p>
<script type="math/tex; mode=display">\mathbb{E}_{q_{\psi}(z''\mid x)}
\mathbb{E}_{q_{\psi}(z'\mid x)}
\Big[
\log
\mathbb{E}_{q_{\varphi}(z\mid z'')}
\Big[
\frac{p_\theta(x,z)}{q_{\varphi}(z\mid z')}
\Big]
\Big]
\ne
\mathbb{E}_{q_{\psi}(z''\mid x)}
\mathbb{E}_{q_{\psi}(z'\mid x)}
[
\log
p_\theta(x)].</script>
<p>When discussing this with Rajesh Ranganath, he points out a simple counterexample where <script type="math/tex">\mathcal{L}_{\text{dvae}}>\log p(x)</script>: Consider a one-dimensional latent variable model, with <script type="math/tex">p(x,z)=\mathrm{Bernoulli}(z;0.5)</script>. Let</p>
<script type="math/tex; mode=display">q(\tilde{x}\mid x) = \mathrm{Bernoulli}(\tilde{x}; 0.5),~
q(z=1\mid \tilde{x}=1) = 0.1,~
q(z=1\mid \tilde{x}=0) = 0.9.</script>
<p>Then <script type="math/tex">q(z\mid x)</script> is Bernoulli with probability <script type="math/tex">0.5\cdot0.1+0.5\cdot0.9=0.5</script>, and</p>
<script type="math/tex; mode=display">\mathcal{L}_{\text{dvae}} =
\mathbb{E}_{q(\tilde{x}\mid x)}\Big[\mathbb{E}_{q(z\mid x)}\Big[\log\frac{p(x, z)}{q(z\mid\tilde{x})}\Big]\Big]</script>
<script type="math/tex; mode=display">= \mathbb{E}_{q(\tilde{x}\mid x)}\Big[0.5\log\frac{0.5}{q(z=1\mid\tilde{x})}+0.5\log\frac{0.5}{q(z=0\mid\tilde{x})}\Big]</script>
<script type="math/tex; mode=display">= 0.5\Big[0.5\log\frac{0.5}{0.1}+0.5\log\frac{0.5}{0.9}\Big]
+
0.5\Big[0.5\log\frac{0.5}{0.9}+0.5\log\frac{0.5}{0.1}\Big]</script>
<script type="math/tex; mode=display">=0.510826...</script>
<p>But <script type="math/tex">\log p(x)=0</script>, implying <script type="math/tex">\mathcal{L}_{\text{dvae}}>\log p(x)</script>. In fact, taking 0.9 in the variational parameter to be arbitrarily close to 1 makes <script type="math/tex">\mathcal{L}_{\text{dvae}}</script> explode.</p>
Tue, 01 Dec 2015 00:00:00 -0500Recurrent Gaussian Processes
http://dustintran.com/blog/recurrent-gaussian-processes
http://dustintran.com/blog/recurrent-gaussian-processes<ul>
<li>César Lincoln C. Mattos, Zhenwen Dai, Andreas Damianou, Jeremy Forth, Guilherme A. Barreto, and Neil D. Lawrence. Recurrent Gaussian Processes. <a href="http://arxiv.org/abs/1511.06644">arXiv preprint arXiv:1511.06644</a>, 2015.</li>
</ul>
<h2 id="summary">Summary</h2>
<p>This is another entry into the successful consolidation of Bayesian deep learning. In the Gaussian process community of recent literature, it’s been sort of a contentious issue to try to beat the practical success of parametric functions using neural networks, with their infinite-dimensional counterpart (Neal, 1996). Gaussian processes have nice Bayesian nonparametric properties over parametric functions, and they encode a distribution over predicted values (as well as latent representations); this provides a distribution modelling uncertainty and which implicitly regularizes through Bayesian inference. As such, there is no need for many of the heuristics proposed to avoid overfitting such as dropout (which is an approximate Bayesian technique anyways). Further, these deep architectures can be applied for learning salient representations of even the smallest data sets.</p>
<p>The paper is very much a spiritual successor to the feedforward architecture proposed in Damianou and Lawrence (2013) for composing deep Gaussian processes. Here, Mattos et al. propose a recurrent architecture for composing recurrent (deep) Gaussian processes. It’s a simple modelling change: replace the compositions of functions used in RNN’s with compositions of Gaussian processes. Not surprisingly, GPs for time series and state space models are related (Frigola et al., 2014; Damianou and Lawrence, 2015); the only difference is that the composition of Gaussian processes in the recurrent architecture leads to a latent autoregressive structure with far more expressive power. I view this difference similarly as the difference between GPLVMs (Lawrence, 2005) and deep GPs (Damianou and Lawrence, 2013).</p>
<p>Inference is non-trivial, and it requires somewhat involved calculations that apply tricks from variational inference on Bayesian GPLVMs (Titsias and Lawrence, 2010) over and over. They are able to set optimal variational distributions for certain latent variables such as the inducing inputs, and generally require a mean-field approximation. They also apply an inference network using an RNN, to avoid learning the massive number of local variational parameters: each layer-wise function of the RNN is the output of the corresponding layer in the composition of variational parameters.</p>
<h2 id="discussion">Discussion</h2>
<p>This is definitely one of my favorite ICLR submissions this year. I see the main contribution of this work as solving two issues: 1. maintaining internal memory representations of the data using a recurrent architecture for GPs, and 2. increasing the scalability of deep GPs.</p>
<p>Any foray into sequence modelling explains why point 1 is really cool. Point 2 is a problem in the original deep GP by Damianou and Lawrence (2013), as the number of variational parameters grows linearly with the size of the data. This makes deep GPs impractical on massive data sets in which neural networks are particularly advantageous. The authors solve this issue here using an RNN as the inference network; moreover, the particular layer-by-layer application of the RNN to output each set of local variational parameters is a pretty unique setup.</p>
<p>On modelling, I didn’t follow the intuition behind what the shared weights in RNNs correspond to in recurrent GPs (shared kernel hyperparameters?).</p>
<p>The model sizes studied in the experiments are rather small, using 1 or 2 hidden layers with at most 100 inducing points. Similar to the deep GP paper, it would be insightful to study varying layers of the recurrent GP and how this impacts performance. How well the recurrent GP maintains internal memory should also be explored. For instance, one can evaluate a standard benchmark on character-to-character sequences in text.</p>
<p>It would be very interesting to explore GRU and LSTM versions, as well as applications in which the recurrent GP employs attention. Perhaps the latter may also avoid the need for development on convolutional GPs.</p>
Sun, 29 Nov 2015 21:55:43 -0500Adversarial autoencoders
http://dustintran.com/blog/adversarial-autoencoders
http://dustintran.com/blog/adversarial-autoencoders<ul>
<li>Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial Autoencoders. <a href="http://arxiv.org/abs/1511.05644">arXiv preprint arXiv:1511.05644</a>, 2015.</li>
</ul>
<h2 id="summary">Summary</h2>
<p>Following the trend in auto-encoders for generative modelling, Makhzani et al. propose an adversarial version using the framework of generative adversarial networks (Goodfellow et al., 2014). I have a bit of trouble following the terminology in these lines of works, so in my summary I will use different notation from the papers, following statistical conventions which I personally find more elucidating. And I’ll motivate it differently, coming from various divergence measure arguments.</p>
<p>I’ll first review generative adversarial networks (GANs). Suppose we observe data <script type="math/tex">x</script>. GANs employ a pair of models <script type="math/tex">(G,D)</script>, where <script type="math/tex">G</script> is the generative model and <script type="math/tex">D</script> is a discriminative model to help perform inference on <script type="math/tex">G</script>.</p>
<ul>
<li>The discriminative model <script type="math/tex">D</script> aims to output high probability for samples <script type="math/tex">x</script> which come from the empirical data distribution <script type="math/tex">p_{\text{data}}(x)</script>, and output low probability for samples <script type="math/tex">x</script> which come from the generative model <script type="math/tex">p_{\text{g}}(x)</script>.</li>
<li>The generative model <script type="math/tex">G</script> aims to fool the discriminative model, i.e., it consists of a prior <script type="math/tex">p(z)</script> and likelihood <script type="math/tex">p(x\mid z)</script>, and it aims to generate data most closely matching the observed data.</li>
</ul>
<p>The objective follows a minimax game</p>
<script type="math/tex; mode=display">\min_G
\max_D\,
\mathbb{E}_{p_{\text{data}}(x)}[\log D(x)] + \mathbb{E}_{p_{\text{g}}(x)}[\log (1- D(x))],</script>
<p>which alternates between learning the discriminator <script type="math/tex">D</script> and learning the generator <script type="math/tex">G</script>. The intuition is that finding the optimal <script type="math/tex">p_{\text{g}}(x)</script> among a class of generative models <script type="math/tex">G</script> corresponds to minimizing the Jensen-Shannon divergence</p>
<script type="math/tex; mode=display">JSD(p_{\text{data}}(x) \| p_{\text{g}}(x)) = \frac{1}{2}\mathrm{KL}\Big(p_{\text{data}} \Big\| \frac{p_{\text{data}} + p_{\text{g}}}{2}\Big) +
\frac{1}{2}\mathrm{KL}\Big(p_{\text{g}} \Big\| \frac{p_{\text{data}} + p_{\text{g}}}{2}\Big).</script>
<p>Minimizing <script type="math/tex">JSD</script> directly is not possible; this is where the objective above, using an additional discriminative model, comes in handy. Recall also that maximum likelihood estimation minimizes <script type="math/tex">\mathrm{KL}(p_{\text{data}} \| p_{\text{g}})</script> (White, 1982). Using JSD leads to a more balanced approximation not underfitting or overfitting modes. It’s also worth noting that unlike approximate posterior inference, these are divergence measures directly on data distributions for <script type="math/tex">x</script>, not on latent variable distributions for <script type="math/tex">z</script>. For example, variational inference minimizes <script type="math/tex">\mathrm{KL}(q(z) \| p(z\mid x))</script>.</p>
<p>In adversarial auto-encoders, we go back to minimizing divergence measures on the latent variables. Recall that the variational auto-encoder’s objective (Kingma and Welling, 2014) is equivalently</p>
<script type="math/tex; mode=display">\min_q\, \mathbb{E}_{q(z\mid x)}[-\log p(x\mid z)] + \mathrm{KL}(q(z\mid x)\|p(z)),</script>
<p>where the first term is the reconstruction error—with the decoder <script type="math/tex">p(x\mid z)</script> evaluating codes from the encoder <script type="math/tex">q(z\mid x)</script>—and the second term is the regularizer. The adversarial auto-encoder’s objective simply changes the regularizer to <script type="math/tex">JSD(p(z)\|q(z))</script>, where we define the <em>aggregated posterior</em></p>
<script type="math/tex; mode=display">q(z) = \int q(z\mid x) p_{\text{data}}(x)\mathrm{d}x.</script>
<p><script type="math/tex">JSD(p(z)\|q(z))</script> is just as intractable as minimizing <script type="math/tex">JSD</script> in the data space, so it requires an adversarial network: <script type="math/tex">p(z)</script> replaces the original <script type="math/tex">p_{\text{data}}(x)</script> and <script type="math/tex">q(z\mid x)</script> replaces the original <script type="math/tex">p_{\text{g}}(x)</script>:</p>
<script type="math/tex; mode=display">\min_G
\max_D\,
\mathbb{E}_{p(z)}[\log D(z)] + \mathbb{E}_{q(z\mid x)}[\log (1- D(z))].</script>
<p>To train adversarial autoencoders, one alternates between minimizing the reconstruction error of the auto-encoder, i.e., <script type="math/tex">\mathbb{E}_{q(z\mid x)}[-\log p(x\mid z)]</script>, and training the adversarial network. The rest of the paper goes into standard experiments for semi-supervised learning on MNIST and SVHN. They also comment on an extension of generative moment matching networks (Li et al., 2015) to auto-encoders, also following these ideas.</p>
<h2 id="discussion">Discussion</h2>
<p>I see generative adversarial networks as fascinating inference algorithms because they’re a rare case where we can minimize a more general divergence measure between the generative model and the data distribution. It makes sense that one should avoid MLE’s KL objective, which can lead to overdispersed distributions, in favor of this approach. And it’s great to see attempts at trying to port these same ideas to the variational inference/auto-encoder setting.</p>
<p>The experiments seem to show that adversarial auto-encoders lead to higher visual fidelity than variational auto-encoders. It’s not clear to me why this is the case however. The fact that there’s even a KL term as a regularizer in the VAE objective is a simple byproduct of the variational lower bound. In the end, it is still minimizing <script type="math/tex">\mathrm{KL}(q(z\mid x)\|p(z\mid x))</script>. However, replacing the KL regularizer with <script type="math/tex">JSD</script> makes it unclear what the underlying divergence measure is any longer. I buy that <script type="math/tex">JSD</script> ensures less “holes” as <script type="math/tex">q(z)</script> approximates the prior <script type="math/tex">p(z)</script>, but there’s a weird conflict as <script type="math/tex">q</script> simultaneously gets trained when minimizing the reconstruction error. The underlying divergence measure, if it exists, is most likely not as simple as <script type="math/tex">JSD(p(z\mid x)\|q(z\mid x))</script>—although if it were then that would be fantastic(?). Does this alternating procedure also necessarily converge?</p>
<p>Obligatory comment in relation to hierarchical variational models (Ranganath et al., 2015): Their “universal approximator posterior”, as a possible choice for the encoder <script type="math/tex">q(z\mid x)</script>, is a hierarchical variational model. To wield the intractability of the density they naively apply Monte Carlo estimates. We’ve found this not to work well in practice due to very high variance, forcing a more complicated inference procedure which we outline in our paper.</p>
<p><em>edit</em> (12/1/15): A previous version of this post wrote <script type="math/tex">\mathrm{JSD}(p(z)\|q(z\mid x))</script> instead of <script type="math/tex">\mathrm{JSD}(p(z)\|q(z))</script>. Thanks to Alireza Makhzani for the correction.</p>
Sun, 29 Nov 2015 16:19:10 -0500Infinite Dimensional Word Embeddings
http://dustintran.com/blog/infinite-dimensional-word-embeddings
http://dustintran.com/blog/infinite-dimensional-word-embeddings<ul>
<li>Eric Nalisnick and Sachin Ravi. Infinite Dimensional Word Embeddings. <a href="http://arxiv.org/abs/1511.05392">arXiv preprint arXiv:1511.05392</a>, 2015.</li>
</ul>
<h2 id="summary">Summary</h2>
<p>Word embeddings have been huge for the NLP community ever since Tomas Mikolov’s 2013 paper (it’s gotten over 1000 citations in 2 years!). The basic idea is to learn a <script type="math/tex">z</script>-dimensional parameter vector <script type="math/tex">w_i\in\mathbb{R}^z</script> associated to each word in a vocabulary. This is important as part of a machine learning pipeline which extracts features for language models, text classification, etc. Not unlike advertisements of discrete Bayesian nonparametric models, Nalisnick and Ravi propose the Infinite Skip-Gram model, which allows the dimension <script type="math/tex">z</script> of the word vectors to grow arbitrarily.</p>
<p>Letting word vector <script type="math/tex">w_i\in\mathbb{R}^\infty</script>, context vector <script type="math/tex">c_k\in\mathbb{R}^\infty</script>, and random positive integer <script type="math/tex">z\in\mathbb{Z}^+</script>, they consider multinomial logistic regression with (Eq.3)</p>
<script type="math/tex; mode=display">p(w_i,c_k,z) = \frac{1}{Z} e^{-E(w_i,c_k,z)},</script>
<p>where <script type="math/tex">Z</script> is the partition function summing over all <script type="math/tex">w_i,c_k,z</script>. This is the same as the original model for word embeddings, except now there is an auxiliary variable <script type="math/tex">z</script> and the vectors are infinite-dimensional. This seems intractable given some energy function <script type="math/tex">E(w_i,c_k,z)</script>, as the partition function requires a summation over countably infinite values of <script type="math/tex">z</script>. However, similar to Côté and Larochelle (2015), they show that the following penalized energy function enables finite computation (Eq.4):</p>
<script type="math/tex; mode=display">E(w_i,c_k,z) = -\sum_{j=1}^z w_{i,j} c_{k,j} + \Big[z\log a -\sum_{j=1}^z (w_{i,j}^2 + c_{k,j}^2)\Big].</script>
<p>The first term is the original (unpenalized) energy function, which is the inner product of <script type="math/tex">z</script>-dimensional vectors. The second term, the penalty, consists of two expressions: the <script type="math/tex">L_2</script> penalty forces the word and context vectors to be zero after sufficiently large index, and the <script type="math/tex">z\log a</script> term forces a convergent geometric series for the denominator of the conditional distribution <script type="math/tex">p(z\mid w, c)</script> (Eq.5). Tractability of the conditional distribution is required during optimization.</p>
<p>For inference, they treat <script type="math/tex">z</script> as a nuisance parameter and perform an approximate maximum likelihood estimation. Specifically, they minimize an upper bound to the negative log-likelihood <script type="math/tex">\mathcal{L}=-\log \int p(c,z\mid w)~\mathrm{d}z</script> over the word vectors <script type="math/tex">w</script>. Similar to how word embeddings require negative sampling for computational efficiency, Nalisnick and Ravi make additional Monte Carlo approximations, such as evaluating Monte Carlo estimates of the gradient of (an upper bound to) this objective.</p>
<h2 id="discussion">Discussion</h2>
<p>Very cool idea! I’m always a fan of infinite-dimensional models, not just because they’re mathematically interesting but because there’s no reason that word vectors should be restricted to a fixed dimensionality. In a streaming setting, for example, the word representations should grow increasingly more complex as more information in the data is available.</p>
<p>Although they don’t explicitly state this, what they’re really doing is applying a variational lower bound for inference, invoking Jensen’s inequality and setting the variational distribution as the prior <script type="math/tex">~q(z)= p(z\mid w_i)</script> (Eq.7). If readers are familiar with the <a href="http://blog.shakirm.com/2015/11/machine-learning-trick-of-the-day-5-log-derivative-trick/">score function estimator</a> used for variational inference (Paisley et al., 2012), the Monte Carlo gradient in Eq.10 applies the same trick. Following this interpretation, it surprises me then that the algorithm works so well. Using model priors for variational approximations make for quite a loose lower bound of the log-likelihood.</p>
<p>I highly recommmend reading this paper. It’s very well-written and easy to read even if you don’t know anything about word embeddings. The related work section places it very appropriately in context. They mention Vilnis and McCallum (2015) a lot, and I’d be very interested to see if they can extend their work to inferring infinite-dimensional distributions of words rather than infinite-dimensional point-estimates of words.</p>
Sat, 28 Nov 2015 17:30:15 -0500Neural Variational Inference for Text Processing
http://dustintran.com/blog/neural-variational-inference-for-text-processing
http://dustintran.com/blog/neural-variational-inference-for-text-processing<ul>
<li>Yishu Miao, Lei Yu, Phil Blunsom. Neural Variational Inference for Text Processing. <a href="http://arxiv.org/abs/1511.06038">arXiv preprint arXiv:1511.06038</a>, 2015.</li>
</ul>
<h2 id="summary">Summary</h2>
<p>Following recent work on variational auto-encoders and their advances in computer vision, the authors propose deep generative models and related inference algorithms for text. The Neural Variational Document Model (NVDM) is an instantiation of the variational auto-encoder for deep latent Gaussian models, where the data input is the bag-of-words representation of the document. There’s a cool interpretation of this, as the Gaussian latent variables can be thought of as the embedding of the discrete vocabulary in a real-valued latent space. In the scenario of multiple hidden layers, the deep latent Gaussian model can be seen as sequentially embedding each corresponding value within nested latent spaces.</p>
<p>As a more complex model, they also propose the Neural Answer Selection Model (NASM) for answer sentence selection, in which the task is to identify the correct answers to a question from a set of possible answers. The NASM employs a latent attention model similar to the deep recurrent attentive writer (Gregor et al., 2015) for modelling images. Given a question <script type="math/tex">q</script>, it attends a set of answer sentences <script type="math/tex">~\{a_1,\ldots,a_n\}</script> associated with <script type="math/tex">q</script>; this in turn determines the context vector, which are the words in the answer sentences that are prominent for predicting the answer matches to the current question. This enables the model to learn subtleties inherent in the questions.</p>
<h2 id="discussion">Discussion</h2>
<p>It’s a simple idea coming from the variational auto-encoder literature and their recent state-of-the-art advances for unsupervised learning in computer vision. This is well worth studying. Interestingly, there doesn’t seem to be any complications when transferring these same ideas to analyzing text.</p>
<p>Confusingly, it doesn’t apply neural variational inference from Mnih and Gregor (2014), as the generative model is continuous. Instead it uses the more common tricks standard in variational auto-encoders.</p>
<p>It would have been interesting to study a generative model of documents which also employs a latent attention architecture. Most immediate in my mind would be avoiding the bag-of-words representation and explicitly modelling the word-to-word sequence of the documents.</p>
Sat, 28 Nov 2015 04:48:57 -0500