http://dustintran.com/blog/
Fri, 24 Mar 2023 02:48:00 -0700Major AI advances this month
http://dustintran.com/blog/ai-advances
http://dustintran.com/blog/ai-advances<style>
table th:first-of-type {
width: 10%;
}
table th:nth-of-type(2) {
width: 60%;
}
</style>
<table>
<thead>
<tr>
<th>Date</th>
<th>Announcement</th>
</tr>
</thead>
<tbody>
<tr>
<td>3/1</td>
<td><a href="https://openai.com/blog/introducing-chatgpt-and-whisper-apis">OpenAI: ChatGPT and Whisper API</a></td>
</tr>
<tr>
<td>3/6</td>
<td><a href="https://ai.googleblog.com/2023/03/universal-speech-model-usm-state-of-art.html">Google: Universal Speech Model</a></td>
</tr>
<tr>
<td>3/10</td>
<td><a href="https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal-language.html">Google: PaLM-E</a></td>
</tr>
<tr>
<td>3/14</td>
<td><a href="https://www.anthropic.com/index/introducing-claude">Anthropic: Claude</a></td>
</tr>
<tr>
<td>3/14</td>
<td><a href="https://blog.google/technology/ai/ai-developers-google-cloud-workspace">Google: PaLM API & Workspace</a></td>
</tr>
<tr>
<td>3/14</td>
<td><a href="https://openai.com/research/gpt-4">OpenAI: GPT-4</a></td>
</tr>
<tr>
<td>3/15</td>
<td><a href="https://www.youtube.com/watch?v=ukvEUI3x0vI">Baidu: ERNIE Bot</a></td>
</tr>
<tr>
<td>3/15</td>
<td><a href="https://twitter.com/midjourney/status/1636130389365497857">Midjourney: Midjourney V5</a></td>
</tr>
<tr>
<td>3/16</td>
<td><a href="https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work">Microsoft: Microsoft 365 Copilot</a></td>
</tr>
<tr>
<td>3/21</td>
<td><a href="https://blog.google/technology/ai/try-bard">Google: Bard</a></td>
</tr>
<tr>
<td>3/21</td>
<td><a href="https://blogs.microsoft.com/blog/2023/03/21/create-images-with-your-words-bing-image-creator-comes-to-the-new-bing">Microsoft: Bing Image Creator</a></td>
</tr>
<tr>
<td>3/22</td>
<td><a href="https://github.blog/2023-03-22-github-copilot-x-the-ai-powered-developer-experience">GitHub: Copilot X</a></td>
</tr>
<tr>
<td>3/23</td>
<td><a href="https://openai.com/blog/chatgpt-plugins">OpenAI: ChatGPT Plugins</a></td>
</tr>
</tbody>
</table>
<p>And for bookkeeping, a relevant collection for February.</p>
<table>
<thead>
<tr>
<th>Date</th>
<th>Announcement</th>
</tr>
</thead>
<tbody>
<tr>
<td>2/6</td>
<td><a href="https://blog.google/technology/ai/bard-google-ai-search-updates">Google: Bard announcement</a></td>
</tr>
<tr>
<td>2/9</td>
<td><a href="https://arxiv.org/abs/2302.04761">Meta: Toolformer</a></td>
</tr>
<tr>
<td>2/13</td>
<td><a href="https://twitter.com/m__dehghani/status/1625186144001396737">Google: Vision Transformer 22B</a></td>
</tr>
<tr>
<td>2/7</td>
<td><a href="https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/">Microsoft: Bing Chat</a></td>
</tr>
<tr>
<td>2/22</td>
<td><a href="https://blogs.microsoft.com/blog/2023/02/22/the-new-bing-preview-experience-arrives-on-bing-and-edge-mobile-apps-introducing-bing-now-in-skype">Microsoft: Bing announcement on mobile and Skype</a></td>
</tr>
<tr>
<td>2/24</td>
<td><a href="https://ai.facebook.com/blog/large-language-model-llama-meta-ai">Meta: LLaMA</a></td>
</tr>
</tbody>
</table>
Thu, 23 Mar 2023 00:00:00 -0700I joined Google
http://dustintran.com/blog/i-joined-google
http://dustintran.com/blog/i-joined-google<p>As a personal news update, starting today, I am at
<a href="https://research.google.com">Google</a> full-time as a Research Scientist. I’ll
be based in the San Francisco and Mountain View offices.</p>
<p>My job search was painless and quick. I applied selectively, on the hope that
it didn’t detract from working on ICML papers and Edward. Well, it did. But
fortunately I’ve still had time to work on them.</p>
<p>Now to finish that pesky Ph.D…</p>
Mon, 05 Feb 2018 00:00:00 -0800At NIPS 2017
http://dustintran.com/blog/at-nips-2017
http://dustintran.com/blog/at-nips-2017<p>I’m at <a href="https://nips.cc/Conferences/2017/">NIPS 2017</a>.
Please catch me or e-mail me if you’d like to chat about research!
<!-- Especially if it's about probabilistic programming, variational -->
<!-- inference, or recent work. -->
I can also describe my experience in
the Bayesflow team at Google—TensorFlow meets Bayes which is led by <a href="https://research.google.com/pubs/105197.html">Rif
Saurous</a> and partially
<a href="https://research.google.com/pubs/KevinMurphy.html">Kevin Murphy</a>—as
well as <a href="http://www.cs.columbia.edu/~blei">David Blei’s group</a> at
Columbia. Both are always looking for excellent researchers as
postdocs, research scientists, and interns.</p>
<p>As advertisement, we’re fortunate to have two posters at the main conference:</p>
<ul>
<li>Adji B. Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, David
M. Blei (2017).
<a href="http://papers.nips.cc/paper/6866-variational-inference-via-chi-upper-bound-minimization">Variational inference via <script type="math/tex">\chi</script> -upper bound
minimization.</a>
<em>Monday, 06:30 – 10:30 PM @ Pacific Ballroom #186</em></li>
<li>Dustin Tran, Rajesh Ranganath, David M. Blei (2017).
<a href="http://papers.nips.cc/paper/7136-hierarchical-implicit-models-and-likelihood-free-variational-inference">Hierarchical implicit models and likelihood-free variational
inference.</a>
<em>Wednesday, 06:30 – 10:30 PM @ Pacific Ballroom #179</em></li>
</ul>
<p>At workshops, I’ll also be presenting three talks and two posters:</p>
<ul>
<li>Title: <em>Implicit causal models for genome-wide association studies</em><br />
Friday, 11:40–12:00 @ Room 104 C. <a href="https://dl4physicalsciences.github.io">NIPS Workshop: Deep Learning for Physical Sciences</a>.</li>
<li>Title: <em>Why Aren’t You Using Probabilistic Programming?</em><br />
Saturday, 8:05–8:30 @ Hall C. <a href="http://bayesiandeeplearning.org">NIPS Workshop: Bayesian Deep Learning</a>.</li>
<li>Title: <em>Lessons learned from designing Edward</em><br />
Saturday, 8:05–8:30 @ Room 202. <a href="https://mltrain.cc/events/nips-highlights-learn-how-to-code-a-paper-with-state-of-the-art-frameworks/">NIPS Workshop: NIPS Highlights, Learn How to Code a Paper</a>.</li>
<li>Poster: <em>Implicit causal models for genome-wide association studies</em> [<a href="https://dl4physicalsciences.github.io/files/nips_dlps_2017_14.pdf">pdf</a>] [<a href="https://arxiv.org/abs/1710.10742">arxiv</a>] <br />
Dustin Tran, David M. Blei (2017).<br />
Friday @ Room 104 C. <a href="https://dl4physicalsciences.github.io">NIPS Workshop: Deep Learning for Physical Sciences</a>.</li>
<li>Poster: <em>Feature-matching auto-encoders</em> [<a href="http://bayesiandeeplearning.org/2017/papers/58.pdf">pdf</a>]<br />
Dustin Tran, Yura Burda, Ilya Sutskever (2017).<br />
Saturday @ Hall C. <a href="http://bayesiandeeplearning.org">NIPS Workshop: Bayesian Deep Learning</a>.</li>
</ul>
<p>The <a href="http://approximateinference.org">NIPS approximate inference workshop</a> will be quite a bit of fun this Friday @ Seaside Ballroom. Join us!</p>
<p>Finally, if you’d like to talk about current research, I’m <em>very</em>
excited to talk about the workshop papers and our recent probabilistic
programming work: <a href="https://arxiv.org/abs/1711.10604">TensorFlow Distributions</a>, and a coming book chapter on deep probabilistic programming with Vikash Mansinghka.</p>
<!-- + Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, Rif A. Saurous (2017). -->
<!-- _TensorFlow Distributions_. -->
<!-- \[[arxiv](https://arxiv.org/abs/1711.10604)\] -->
<!-- Vikash Mansinghka and I have been writing a book chapter on -->
<!-- probabilistic programming for deep generative models. Expect it -->
<!-- out within coming weeks. -->
<p>See you at the conference.</p>
Mon, 04 Dec 2017 00:00:00 -0800On Pyro - Deep Probabilistic Programming on PyTorch
http://dustintran.com/blog/on-pyro-deep-probabilistic-programming-on-pytorch
http://dustintran.com/blog/on-pyro-deep-probabilistic-programming-on-pytorch<p>Pyro, a “deep universal probabilistic programming language” on
PyTorch, was announced today (see <a href="https://eng.uber.com/pyro/">blog post</a>; <a href="http://pyro.ai">homepage</a>).
People were curious of my thoughts. I shared a short note on
<a href="https://www.reddit.com/r/MachineLearning/comments/7ak6x9/n_uber_ai_labs_open_sources_pyro_a_deep/">reddit</a>,
copied and pasted below:</p>
<blockquote>
<p>This is great work coming from the Uber AI labs, especially by Eli Bingham and Noah Goodman for leading this effort among an excellent group. I’ve met with them in-person on numerous occasions to discuss the overall design and implementation details. Pyro touches on interesting aspects in PPL research: dynamic computational graphs, deep generative models, and programmable inference.</p>
<p>It’s yet to see where Pyro will come to fruition. Personally, inheriting from my advisors David Blei and Andrew Gelman, I like to think from a bottom-up view where applications ground design principles; and they end up determining the direction and success of a PPL. For Stan, it’s hierarchical GLMs fueled with HMC across a variety of social and political sciences. For Edward, it’s deep latent variable models fueled with black box VI across text, images, and spatial data. I’d like to see where Pyro not only makes dynamic probabilistic programming easier, but (1) what applications it enables that was not possible before; and (2) what new PPL innovations come out. Attend, Infer, Repeat (<a href="https://github.com/uber/pyro/blob/dev/tutorial/source/air.ipynb">Pyro notebook</a>) is a great example in this direction.</p>
<p>On speed: Pyro might be faster than Edward on CPUs depending on the intensity of graph-building in PyTorch vs TensorFlow. I’m confident Edward will dominate on GPUs (certainly TPUs) when data or model parallelism is the bottleneck. It warrants benchmarks, including Pyro vs native PyTorch. Edward benefits from speed being just as fast as native TF because the underlying computational graph is the same. Dynamic PPLs trade off that benefit.</p>
</blockquote>
Fri, 03 Nov 2017 00:00:00 -0700NIPS 2017 Workshop on Approximate Inference
http://dustintran.com/blog/nips-2017-workshop-on-approximate-inference
http://dustintran.com/blog/nips-2017-workshop-on-approximate-inference<p>This year we’re organizing the <a href="http://approximateinference.org">third NIPS workshop on approximate inference</a>. It is together with Francisco Ruiz, Stephan Mandt, Cheng Zhang, and James Mclnerney—and alongside our amazing committee of Tamara Broderick, Michalis Titsias, David Blei, and Max Welling.</p>
<p>Call for papers below.</p>
<p><strong>Note</strong>: We have a <em>lot</em> of funding for awards this year. We’ve
decided to not only allocate some funding for Ph.D. students and early
postdocs, but we also feature a best paper award. Submit your papers!</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Call For Papers
NIPS Workshop on Advances in Approximate Bayesian Inference
Friday, 8th December 2017, Long Beach, California
URL: http://approximateinference.org
Submission deadline: Nov 01, 2017
Please direct questions to: aabiworkshop2017@gmail.com
## Call for Participation
We invite researchers to submit their recent work on the development, analysis, or application of approximate Bayesian inference.
A submission should take the form of an extended abstract of 2-4 pages in PDF format using the NIPS style. Author names do not need to be anonymized and references may extend as far as needed beyond the 4 page upper limit. If authors' research has previously appeared in a journal, workshop, or conference (including NIPS 2017), their workshop submission should extend that previous work. Submissions may include a supplement/appendix, but reviewers are not responsible for reading any supplementary material.
This year, the workshop offers multiple best paper awards. They are open to all researchers, and a few awards are restricted to junior researchers. Submitting by the deadline automatically entitles you for consideration for all of the following:
+ Roughly $3000 in total, to be allocated across winners
+ Four NIPS 2017 workshop registration fee waivers
## Abstract
Approximate inference is key to modern probabilistic modeling. Thanks to the availability of big data, significant computational power, and sophisticated models, machine learning has achieved many breakthroughs in multiple application domains. At the same time, approximate inference becomes critical since exact inference is intractable for most models of interest. Within the field of approximate Bayesian inference, variational and Monte Carlo methods are currently the mainstay techniques. For both methods, there has been considerable progress both on the efficiency and performance.
In this workshop, we encourage submissions advancing approximate inference methods. We are open to a broad scope of methods within the field of Bayesian inference. In addition, we also encourage applications of approximate inference in many domains, such as computational biology, recommender systems, differential privacy, and industry applications.
## Key Dates
Nov 01, 2017: Submission Deadline
Nov 15, 2017: Notification of Acceptance
Nov 24, 2017: Submission Reviews & Award Notifications
## Organizers
Francisco Ruiz, Stephan Mandt, Cheng Zhang, James Mclnerney, Dustin Tran
## Advisory Committee
Tamara Broderick, Michalis Titsias, David Blei, Max Welling
</code></pre></div></div>
Mon, 25 Sep 2017 00:00:00 -0700How much compute do we need to train generative models?
http://dustintran.com/blog/how-much-compute-do-we-need-to-train-generative-models
http://dustintran.com/blog/how-much-compute-do-we-need-to-train-generative-models<p><em>Update (09/01/17): The post is written to be somewhat silly and numbers are not meant to be accurate. For example, there is a simplifying assumption that training time scales linearly with the # of bits to encode the output; and 5000 is chosen arbitrarily given only that the output’s range has 65K*3 dimensions and each takes one of 256 integers.</em></p>
<p>Discriminative models can take weeks to train. It was only until a
breakthrough two months ago by Facebook <a class="citation" href="#goyal2017accurate">(Goyal et al., 2017)</a> that we could successfully train a neural net
exceeding human accuracy (ResNet-50) on ImageNet in one hour. And this
was with 256 GPUs and a monstrous batch size of 8192.
<!-- Unfortunately, most of us mortals have maybe 8 GPUs at most—or for the -->
<!-- very fortunate, at most 8 GPUs per experiment—and do not have help -->
<!-- from the first authors of ResNets and Caffe. This means in 2017, each -->
<!-- ImageNet classifier can still take days to a week. --></p>
<p>Contrast this with generative models. We’ve made progress in
stability and sample diversity with generative adversarial networks,
where, say, Wasserstein GANs with gradient penalty
<a class="citation" href="#gulrajani2017improved">(Gulrajani, Ahmed, Arjovsky, Dumoulin, & Courville, 2017)</a> and
Cramer GANs
<a class="citation" href="#bellemare2017cramer">(Bellemare et al., 2017)</a>
can get good results for generating LSUN bedrooms.
But in communication with
Ishaan Gulrajani, this took 3 days to train with 4 GPUs and 900,000
total iterations; moreover, LSUN
has a resolution of 64x64 and is
significantly less diverse than the 256x256 sized ImageNet.
<!-- : this is especially the case as we do 5 discriminator -->
<!-- updates per generator update, which is already a 5x slowdown compared -->
<!-- to vanilla GANs per-generator iteration. -->
Let’s also not kid ourselves
that we perfected density estimation to learn the true distribution of
LSUN bedrooms yet.</p>
<p>Generative models for text are no different. The best results so far for the 1 billion
language modeling benchmark are an LSTM with 151 million parameters
(excluding embedding and softmax layers)
which took 3 weeks to train with 32 GPUs
<a class="citation" href="#jozefowicz2016exploring">(Jozefowicz, Vinyals, Schuster, Shazeer, & Wu, 2016)</a>
and a mixture of experts LSTM with 4.3 billion parameters
<a class="citation" href="#shazeer2017outrageously">(Shazeer et al., 2017)</a>.
<!-- downsampled ImageNet. --></p>
<p>This begs the question: how much compute <em>should</em> we expect in order
to learn a generative model?</p>
<p>Suppose we restrict ourselves to 256x256 ImageNet as a proxy for
natural images.
A simple property in information theory says that the the entropy of
the conditional
<script type="math/tex">p(\text{class label}\mid \text{natural image})</script>
is upper bounded by at most <script type="math/tex">\log K</script> bits for <script type="math/tex">K</script> classes.
Comparing this to the entropy of the unconditional
<script type="math/tex">p(\text{natural image})</script>, whose
number of bits is a function of <script type="math/tex">256\times 256=65,536</script>
pixels each of which take 3 values from <script type="math/tex">[0, 255]</script>,
then a very modest guess would be that <script type="math/tex">p(\text{natural image})</script> has
5000 times more bits. We also need to
account for the difference in training methods. Let’s say that the
method for generative models is only 6x slower than that of
discriminative models (5 discriminative updates per generator update;
we’ll forget the fact that GAN and MMD objectives are actually more expensive
than maximum likelihood due to multiple forward and backward passes).</p>
<p>Finally, let’s take Facebook’s result as a baseline for learning
<script type="math/tex">p(\text{class label}\mid \text{natural image})</script> in 1 hour with 256 GPUs
and a batch size of 8192. <strong>Then the distribution <script type="math/tex">p(\text{natural image})</script>
would require
1 hour <script type="math/tex">\cdot</script> 5000 <script type="math/tex">\cdot</script> 6 <script type="math/tex">=</script> 30,000 hours <script type="math/tex">\approx</script> 3.4 years to train.</strong>
And this is assuming we have the right objective, architecture, and
hyperparameters to set it and forget it: until then, let’s hope for
better hardware.</p>
<p><em>This short post is extracted from a fun conversation with Alec Radford today.</em></p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="bellemare2017cramer">Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., & Munos, R. (2017). The Cramer Distance as a Solution to Biased Wasserstein Gradients. <i>ArXiv Preprint ArXiv:1705.10743</i>.</span></li>
<li><span id="goyal2017accurate">Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., … He, K. (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. <i>ArXiv Preprint ArXiv:1706.02677</i>.</span></li>
<li><span id="gulrajani2017improved">Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved Training of Wasserstein GANs. <i>ArXiv Preprint ArXiv:1704.00028</i>.</span></li>
<li><span id="jozefowicz2016exploring">Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the Limits of Language Modeling. <i>ArXiv Preprint ArXiv:1602.02410</i>.</span></li>
<li><span id="shazeer2017outrageously">Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In <i>International Conference on Learning Representations</i>.</span></li></ol>
Thu, 31 Aug 2017 00:00:00 -0700My Qualifying Exam (Oral)
http://dustintran.com/blog/my-qualifying-exam-oral
http://dustintran.com/blog/my-qualifying-exam-oral<p>I’m taking my qualifying exam this Tuesday—which may surprise some of
you that I haven’t already done it! This is mostly due to logistical
kerfuffles as I transferred Ph.D.’s and I also tend to avoid coursework
like the plague.</p>
<p>Each university has its own culture around an oral or qualifying exam.
Columbia’s Computer Science department involves the following:</p>
<blockquote>
<p>The committee, after consideration of the student’s input, selects a syllabus of the 20-30 most significant documents that encompass the state of the art in the area. […] The oral exam begins with the student’s 30 minute critical evaluation of the syllabus materials, and is followed by no more than 90 minutes of questioning by the committee on any subject matter related to their contents. The student is judged primarily on the oral evidence, but the content and style of the presentation can account for part of the decision.
<a href="http://www.cs.columbia.edu/education/phd/requirements/candidacy/">[url]</a></p>
</blockquote>
<p>My syllabus concerns <em>Bayesian deep learning</em>, which is the
synthesis of modern Bayesian analysis with deep learning.
The syllabus includes 29 papers published in 2014 or later,
representing “the most significant documents that encompass the
state of the art in the area.”
I got multiple requests from friends to share the list, so I decided
to just share it publically.</p>
<p><strong>Probabilistic programming & AI systems</strong></p>
<ol>
<li><a class="citation" href="#mansinghka2014venture">Mansinghka, Selsam, & Perov (2014)</a></li>
<li><a class="citation" href="#tristan2014augur">Tristan et al. (2014)</a></li>
<li><a class="citation" href="#schulman2015gradient">Schulman, Heess, Weber, & Abbeel (2015)</a></li>
<li><a class="citation" href="#narayanan2016probabilistic">Narayanan, Carette, Romano, Shan, & Zinkov (2016)</a></li>
<li><a class="citation" href="#abadi2016tensorflow">Abadi et al. (2016)</a></li>
<li><a class="citation" href="#carpenter2016stan">Carpenter et al. (2016)</a></li>
<li><a class="citation" href="#tran2016edward">Tran et al. (2016)</a></li>
<li><a class="citation" href="#kucukelbir2017automatic">Kucukelbir, Tran, Ranganath, Gelman, & Blei (2017)</a></li>
<li><a class="citation" href="#tran2017deep">Tran et al. (2017)</a></li>
<li><a class="citation" href="#neubig2017dynet">Neubig et al. (2017)</a></li>
</ol>
<p><strong>Variational inference</strong></p>
<ol>
<li><a class="citation" href="#kingma2014autoencoding">Kingma & Welling (2014)</a></li>
<li><a class="citation" href="#ranganath2014black">Ranganath, Gerrish, & Blei (2014)</a></li>
<li><a class="citation" href="#rezende2014stochastic">Rezende, Mohamed, & Wierstra (2014)</a></li>
<li><a class="citation" href="#mnih2014neural">Mnih & Gregor (2014)</a></li>
<li><a class="citation" href="#rezende2015variational">Rezende & Mohamed (2015)</a></li>
<li><a class="citation" href="#salimans2015markov">Salimans, Kingma, & Welling (2015)</a></li>
<li><a class="citation" href="#tran2016variational">Tran, Ranganath, & Blei (2016)</a></li>
<li><a class="citation" href="#ranganath2016hierarchical">Ranganath, Tran, & Blei (2016)</a></li>
<li><a class="citation" href="#maaloe2016auxiliary">Maaløe, Sønderby, Sønderby, & Winther (2016)</a></li>
<li><a class="citation" href="#johnson2016composing">Johnson, Duvenaud, Wiltschko, Datta, & Adams (2016)</a></li>
<li><a class="citation" href="#ranganath2016operator">Ranganath, Altosaar, Tran, & Blei (2016)</a></li>
<li><a class="citation" href="#gelman2017expectation">Gelman et al. (2017)</a></li>
</ol>
<p><strong>Implicit probabilistic models & adversarial training</strong></p>
<ol>
<li><a class="citation" href="#goodfellow2014generative">Goodfellow et al. (2014)</a></li>
<li><a class="citation" href="#dziugaite2015training">Dziugaite, Roy, & Ghahramani (2015)</a></li>
<li><a class="citation" href="#li2015generative">Li, Swersky, & Zemel (2015)</a></li>
<li><a class="citation" href="#radford2016unsupervised">Radford, Metz, & Chintala (2016)</a></li>
<li><a class="citation" href="#mohamed2016learning">Mohamed & Lakshminarayanan (2016)</a></li>
<li><a class="citation" href="#arjovsky2017wasserstein">Arjovsky, Chintala, & Bottou (2017)</a></li>
<li><a class="citation" href="#tran2017deepand">Tran, Ranganath, & Blei (2017)</a></li>
</ol>
<p>Committee: David Blei, Andrew Gelman, Daniel Hsu.</p>
<p>Disclaimer: I favored papers which have
shown to be or are most likely to be long-lasting in influence (this
means fewer papers from 2017); papers on methodology rather than
applications (only to narrow the scope); original papers over surveys;
and my own papers (because it’s my oral). If I did not cite you or if
you have strong opinions about a missing paper, recall <a href="https://en.wikipedia.org/wiki/Hanlon%27s_razor">Hanlon’s
razor</a>. E-mail me your
suggestions.</p>
<p>Update (08/08/2017): I passed the oral. :-)</p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="abadi2016tensorflow">Abadi, Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … Zhang, X. (2016). TensorFlow: A system for large-scale machine learning, <i>cs.DC</i>, 1–18.</span></li>
<li><span id="arjovsky2017wasserstein">Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. In <i>International Conference on Machine Learning</i>.</span></li>
<li><span id="carpenter2016stan">Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., … Riddell, A. (2016). Stan: a probabilistic programming language. <i>Journal of Statistical Software</i>.</span></li>
<li><span id="dziugaite2015training">Dziugaite, G. K., Roy, D. M., & Ghahramani, Z. (2015). Training generative neural networks via Maximum Mean Discrepancy optimization. In <i>Uncertainty in Artificial Intelligence</i>.</span></li>
<li><span id="gelman2017expectation">Gelman, A., Vehtari, A., Jylänki, P., Sivula, T., Tran, D., Sahai, S., … Robert, C. (2017). Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data. <i>ArXiv.org</i>.</span></li>
<li><span id="goodfellow2014generative">Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative Adversarial Nets. In <i>Neural Information Processing Systems</i>.</span></li>
<li><span id="johnson2016composing">Johnson, M. J., Duvenaud, D., Wiltschko, A. B., Datta, S. R., & Adams, R. P. (2016). Composing graphical models with neural networks for structured representations and fast inference. In <i>Neural Information Processing Systems</i>.</span></li>
<li><span id="kingma2014autoencoding">Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. In <i>International Conference on Learning Representations</i>.</span></li>
<li><span id="kucukelbir2017automatic">Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic Differentiation Variational Inference. <i>Journal of Machine Learning Research</i>, <i>18</i>, 1–45.</span></li>
<li><span id="li2015generative">Li, Y., Swersky, K., & Zemel, R. (2015). Generative Moment Matching Networks. In <i>International Conference on Machine Learning</i>.</span></li>
<li><span id="maaloe2016auxiliary">Maaløe, L., Sønderby, C. K., Sønderby, S. K., & Winther, O. (2016). Auxiliary Deep Generative Models. In <i>International Conference on Machine Learning</i>.</span></li>
<li><span id="mansinghka2014venture">Mansinghka, V., Selsam, D., & Perov, Y. (2014). Venture: a higher-order probabilistic programming platform with programmable inference. <i>ArXiv.org</i>.</span></li>
<li><span id="mnih2014neural">Mnih, A., & Gregor, K. (2014). Neural Variational Inference and Learning in Belief Networks. In <i>International Conference on Machine Learning</i>.</span></li>
<li><span id="mohamed2016learning">Mohamed, S., & Lakshminarayanan, B. (2016). Learning in Implicit Generative Models. <i>ArXiv.org</i>.</span></li>
<li><span id="narayanan2016probabilistic">Narayanan, P., Carette, J., Romano, W., Shan, C.-chieh, & Zinkov, R. (2016). Probabilistic Inference by Program Transformation in Hakaru (System Description). In <i>International Symposium on Functional and Logic Programming</i>. Springer, Cham.</span></li>
<li><span id="neubig2017dynet">Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W., Anastasopoulos, A., … Yin, P. (2017). DyNet: The Dynamic Neural Network Toolkit. <i>ArXiv.org</i>.</span></li>
<li><span id="radford2016unsupervised">Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In <i>International Conference on Learning Representations</i>.</span></li>
<li><span id="ranganath2016operator">Ranganath, R., Altosaar, J., Tran, D., & Blei, D. M. (2016). Operator Variational Inference. In <i>Neural Information Processing Systems</i>.</span></li>
<li><span id="ranganath2014black">Ranganath, R., Gerrish, S., & Blei, D. M. (2014). Black Box Variational Inference. In <i>Artificial Intelligence and Statistics</i>.</span></li>
<li><span id="ranganath2016hierarchical">Ranganath, R., Tran, D., & Blei, D. M. (2016). Hierarchical Variational Models. In <i>International Conference on Machine Learning</i>.</span></li>
<li><span id="rezende2015variational">Rezende, D. J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. In <i>International Conference on Machine Learning</i>.</span></li>
<li><span id="rezende2014stochastic">Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In <i>International Conference on Machine Learning</i>.</span></li>
<li><span id="salimans2015markov">Salimans, T., Kingma, D. P., & Welling, M. (2015). Markov Chain Monte Carlo and Variational Inference: Bridging the Gap. In <i>International Conference on Machine Learning</i>.</span></li>
<li><span id="schulman2015gradient">Schulman, J., Heess, N., Weber, T., & Abbeel, P. (2015). Gradient Estimation Using Stochastic Computation Graphs. In <i>Neural Information Processing Systems</i>.</span></li>
<li><span id="tran2017deep">Tran, D., Hoffman, M. D., Saurous, R. A., Brevdo, E., Murphy, K., & Blei, D. M. (2017). Deep Probabilistic Programming. In <i>International Conference on Learning Representations</i>.</span></li>
<li><span id="tran2016edward">Tran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D., & Blei, D. M. (2016). Edward: A library for probabilistic modeling, inference, and criticism. <i>ArXiv.org</i>.</span></li>
<li><span id="tran2017deepand">Tran, D., Ranganath, R., & Blei, D. M. (2017). Deep and Hierarchical Implicit Models. <i>ArXiv.org</i>.</span></li>
<li><span id="tran2016variational">Tran, D., Ranganath, R., & Blei, D. M. (2016). The Variational Gaussian Process. In <i>International Conference on Learning Representations</i>.</span></li>
<li><span id="tristan2014augur">Tristan, J.-B., Huang, D., Tassarotti, J., Pocock, A. C., Green, S., & Steele, G. L. (2014). Augur: Data-Parallel Probabilistic Modeling. In <i>Neural Information Processing Systems</i> (pp. 2600–2608).</span></li></ol>
Mon, 07 Aug 2017 00:00:00 -0700A Research to Engineering Workflow
http://dustintran.com/blog/a-research-to-engineering-workflow
http://dustintran.com/blog/a-research-to-engineering-workflow<p>Going from a research idea to experiments is fundamental. But this
step is typically glossed over with little explicit advice. In
academia, the graduate student is often left toiling away—fragmented
code, various notes and LaTeX write-ups scattered around.
New projects often result in entirely new code bases, and if they do
rely on past code, are difficult to properly extend to these new projects.</p>
<p>Motivated by this, I thought it’d be useful to outline the steps I
personally take in going from research idea to experimentation, and
how that then improves my research understanding so I can revise the
idea. This process is crucial: given an initial idea, all my time is
spent on this process;
<!--(with the majority of my time specifically spent on the experiments)-->
and for me at least, the experiments are key to
learning about and solving problems that I couldn’t predict otherwise.<a href="#references"><sup>1</sup></a>
<!--More generally, I think without having a good -->
<!--handle on this process, you can sometimes lose touch with -->
<!--reality and have a hard time -->
<!--[>figuring out what open problems and/or solutions are important.<] -->
<!--recalling why the problems you're working on are important. --></p>
<!--Much of what I'll describe is what other researchers, collaborators,-->
<!--and friends I know already do. I'm hoping to make these steps -->
<!--transparent, we can review the workflow, compare it to alternatives,-->
<!--and see how it might be optimized. -->
<!--## Coming up with an Idea-->
<h2 id="finding-the-right-problem">Finding the Right Problem</h2>
<!--## A Master List of Research Ideas-->
<!--+ reading papers -->
<!--+ talking to people at conferences, workshops, etc. to see what find -->
<!-- important -->
<!--+ personal experiences (in both research/understanding and in your own-->
<!-- experiments) -->
<!--+ frequent communication with people near you, if you have the benefit-->
<!-- of like minded(or even not like minded) people as neighbors to -->
<!-- bounce ideas off of -->
<!--This is the modt open ended and often in my opinion one of the most-->
<!--challenging. It must be interesting to you, ideally ambitious with -->
<!--clear and amazing end goals, important to many people in the -->
<!--community, and with short and long term visions. -->
<!--Research is an organic process. Repositories file that research in-->
<!--discrete units. Before making a repository, it's necessary to -->
<!--decide how initial ideas might jumpstart into more official and -->
<!--formalized. -->
<!--This particular step varies widely. -->
<p>Before working on a project, it’s necessary to decide how
ideas might jumpstart into something more official. Sometimes it’s as
simple as having a mentor suggest a project to work on; or tackling a
specific data set or applied problem; or having a conversation with a
frequent collaborator and then striking up a useful problem
to work on together. More often, I find that research is
a result of a long chain of ideas which were continually
iterated upon—through frequent conversations, recent
work, longer term readings of subjects I’m unfamiliar with
(e.g., <a class="citation" href="#pearl2000causality">Pearl (2000)</a>),
and
favorite papers I like to revisit (e.g.,
<a class="citation" href="#wainwright2008graphical">Wainwright & Jordan (2008)</a>,
<a class="citation" href="#neal1994bayesian">Neal (1994)</a>).</p>
<p><img src="/blog/assets/2017-06-03-fig0.png" alt="" />
<em><center>A master document of all my unexplored research ideas.</center></em></p>
<p>One technique I’ve found immensely helpful is to maintain a single
master document.<a href="#references"><sup>2</sup></a>
It does a few things.</p>
<p>First, it has a bulleted list of all ideas, problems, and topics that
I’d like to think more carefully about (Section 1.3 in the figure).
Sometimes they’re as high-level as “Bayesian/generative approaches to
reinforcement learning” or “addressing fairness in machine learning”;
or they’re as specific as “Inference networks to handle memory
complexity in EP” or “analysis of size-biased vs symmetric Dirichlet
priors.”. I try to keep the list succinct: subsequent sections go in
depth on a particular entry (Section 2+ in the figure).</p>
<p>Second, the list of ideas is sorted according to what I’d like to work on
next. This guides me to understand the general direction of my
research beyond present work. I can continually revise my
priorities according to whether I think the direction aligns
with my broader research vision, and if I think the direction is
necessarily impactful for the community at large.
<!-- -->
Importantly, the list isn’t just about the next publishable idea to
work on, but generally what things I’d like to learn about next. This
contributes long-term in finding important problems and arriving at
simple or novel solutions.</p>
<p>Every so often, I revisit the list, resorting things, adding things,
deleting things. Eventually I might elaborate upon an idea enough that
it becomes a formal paper. In general, I’ve found that this process
of iterating upon ideas within one location (and one format) makes
the transition to formal paper-writing and experiments to be a fluid experience.</p>
<h2 id="managing-papers">Managing Papers</h2>
<p><img src="/blog/assets/2017-06-03-fig5.png" alt="" /></p>
<p>Good research requires reading <em>a lot</em> of papers. Without a good way
of organizing your readings, you can easily get overwhelmed by the
field’s hurried pace. (These past
weeks have been especially notorious in trying to catch up on the slew
of NIPS submissions posted to arXiv.)</p>
<p>I’ve experimented with a lot of approaches to this, and ultimately
I’ve arrived at the <a href="http://papersapp.com">Papers app</a> which I highly
recommend.<sup>3</sup></p>
<p>The most fundamental utility in a good management system is a
centralized repository which can be referenced back to. The advantage
of having one location for this cannot be underestimated, whether it
be 8 page conference papers, journal papers, surveys, or even textbooks.
Moreover, Papers is a nice tool for actually reading PDFs, and it
conveniently syncs across devices as I read and star things on my
tablet or laptop. As I cite papers when I write, I can go back to
Papers and get the corresponding BibTeX file and citekey.</p>
<p>I personally enjoy taking painstaking effort in organizing papers. In
the screenshot above, I have a sprawling list of topics as paper tags.
These range from <code class="highlighter-rouge">applications</code>, <code class="highlighter-rouge">models</code>, <code class="highlighter-rouge">inference</code> (each with
subtags), and there are also miscellaneous topics such as
<code class="highlighter-rouge">information-theory</code> and <code class="highlighter-rouge">experimental-design</code>. An important
collection not seen in the screenshot is a tag called <code class="highlighter-rouge">research</code>,
which I bin all papers relevant to a particular research topic into.
For example, <a href="https://arxiv.org/abs/1706.00531">the PixelGAN paper</a>
presently highlighted is tagged into two topics I’ve currently been
thinking a lot about—these are sorted into <code class="highlighter-rouge">research→alignment-semi</code>
and <code class="highlighter-rouge">research→generative-images</code>.</p>
<h2 id="managing-a-project">Managing a Project</h2>
<p><img src="/blog/assets/2017-06-03-fig1.png" alt="" />
<em><center>The repository we used for a recent
<a href="https://arxiv.org/abs/1610.09037">arXiv preprint</a>.</center></em></p>
<p>I like to maintain one research project in one Github repository.
<!--Whatever one "unit" of research is varies. I define it as something -->
<!--relatively self-contained; for example, it might be tied to a specific-->
<!--paper, an applied data analysis, or a particular topic at hand. -->
<!-- -->
They’re useful not only for tracking code but also
in tracking general research progress, paper writing, and tying others
in for collaboration. How Github repositories are organized is a frequent pain point.
I like the following structure,
based originally from <a href="http://www.cs.columbia.edu/~blei/seminar/2016_discrete_data/notes/week_01.pdf">Dave Blei’s preferred one</a>:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-- doc/
-- 2017-nips/
-- preamble/
-- img/
-- main.pdf
-- main.tex
-- introduction.tex
-- etc/
-- 2017-03-25-whiteboard.jpg
-- 2017-04-03-whiteboard.jpg
-- 2017-04-06-dustin-comments.md
-- 2017-04-08-dave-comments.pdf
-- src/
-- checkpoints/
-- codebase/
-- log/
-- out/
-- script1.py
-- script2.py
-- README.md
</code></pre></div></div>
<p><code class="highlighter-rouge">README.md</code> maintains a list of todo’s, both for myself and
collaborators. This makes it transparent how to keep moving forward
and what’s blocking the work.</p>
<p><code class="highlighter-rouge">doc/</code> contains all write-ups. Each subdirectory corresponds to a
particular conference or journal submission, with <code class="highlighter-rouge">main.tex</code> being the
primary document and individual sections written in separate files
such as <code class="highlighter-rouge">introduction.tex</code>. Keeping one section per file makes
it easy for multiple people to work on separate sections
simultaneously and avoid merge conflicts. Some people prefer to write
the full paper after major experiments are complete. I personally like to
write a paper more as a summary of the current ideas and, as with the
idea itself, it is continually revised as experiments proceed.</p>
<p><code class="highlighter-rouge">etc/</code> is a dump of everything not relevant to other directories. I
typically use it to store pictures of whiteboards during conversations
about the project. Or sometimes as I’m just going about my day-to-day,
I’m struck with a bunch of ideas and so I dump them into a Markdown
document. It’s also a convenient location to handle various
commentaries about the work, such as general feedback or paper
markups from collaborators.</p>
<p><code class="highlighter-rouge">src/</code> is where all code is written. Runnable scripts are written
directly in <code class="highlighter-rouge">src/</code>, and classes and utilities are written in
<code class="highlighter-rouge">codebase/</code>. I’ll elaborate on these next. (The other three are
directories outputted from scripts, which I’ll also elaborate upon.)</p>
<h2 id="writing-code">Writing Code</h2>
<p><img src="/blog/assets/2017-06-03-fig2.png" alt="" />
<!--_<center>A master document of all my unexplored research ideas.</center>_--></p>
<p>Any code I write now uses <a href="http://edwardlib.org">Edward</a>.
<!--which uses -->
<!--[Python](https://www.python.org) and [TensorFlow](https://www.tensorflow.org).-->
I find it to be the best framework for
quickly experimenting with modern probabilistic models and algorithms.
<!--[<sup>3</sup>](#references)-->
<!--developing modern probabilistic models and inference algorithms.-->
<!--, and with plug-and-play with built-in methods and pre-existing examples.-->
<!-- -->
<!--Previously I had to resort to working in fragmented code bases, where -->
<!--one language had one idea, a pre-existing code base was hacked upon to-->
<!--support certain other features, additional interface layers were -->
<!--written to get them all to communicate together... it was not good. -->
<!--Maintaining these dependencies and duplicate implemented ideas is a -->
<!--nightmare. And more importantly the code just constrains the sorts of-->
<!--ideas/experiments you'd like to do. -->
<!--With Edward, everything is just there™. --></p>
<p>On a conceptual level, Edward’s
appealing because the language explicitly follows the math: the
model’s generative process translates to specific lines of Edward
code; then the proposed algorithm translates to the next lines; etc. This
clean translation
<!--makes it easy to understand the mapping between math-->
<!--and code. And it -->
avoids future abstraction headaches when trying to extend the
code with natural research questions: for example, what if I used a different
prior, or tweaked the gradient estimator, or tried a different
neural net architecture, or applied the method on larger scale data sets?
<!-- which makes it easy to translate engineering -->
<!--ideas about sharing various components to the math, and analogously -->
<!--how to easily take tweaked math ideas and replace the corresponding -->
<!--code. --></p>
<p>On a practical level, I most benefit from Edward by building off
pre-existing model examples
(in <a href="https://github.com/blei-lab/edward/tree/master/examples"><code class="highlighter-rouge">edward/examples/</code></a> or <a href="https://github.com/blei-lab/edward/tree/master/notebooks"><code class="highlighter-rouge">edward/notebooks/</code></a>),
and then adapting it to my problem.
If I am also implementing a new
algorithm, I take a pre-existing algorithm’s source
code (in <a href="https://github.com/blei-lab/edward/tree/master/edward/inferences"><code class="highlighter-rouge">edward/inferences/</code></a>),
paste it as a new file in my research project’s <code class="highlighter-rouge">codebase/</code> directory,
and then I tweak it. This process makes it really easy to start
afresh—beginning from templates and avoiding low-level details.</p>
<p>When writing code, I always follow PEP8 (I particularly like the
<a href="https://pypi.python.org/pypi/pep8"><code class="highlighter-rouge">pep8</code></a> package), and I try
to separate individual scripts from the class and function definitions shared
across scripts; the latter is placed inside <code class="highlighter-rouge">codebase/</code> and then imported.
Maintaining code quality from the beginning is always a good
investment, and I find this process scales well as the code gets
increasingly more complicated and worked on with others.</p>
<p><strong>On Jupyter notebooks.</strong>
Many people use <a href="http://jupyter.org">Jupyter notebooks</a>
as a method for interactive code development, and as an easy way to
embed visualizations and LaTeX. I personally haven’t found
success in integrating it into my workflow. I like to just write all
my code down in a Python script and then run the script. But I can see why
others like the interactivity.</p>
<h2 id="managing-experiments">Managing Experiments</h2>
<p><img src="/blog/assets/2017-06-03-fig3.png" alt="" /></p>
<p>Investing in a good workstation or cloud service is a must.
Features such as GPUs should basically
be a given with <a href="http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/">their wide
availability</a>,
and one should have access to running many jobs in parallel.</p>
<p>After I finish writing a script on my local computer, my typical workflow is:</p>
<ol>
<li>Run <code class="highlighter-rouge">rsync</code> to synchronize my local computer’s Github repository
(which includes uncommitted files) with a directory in the server;</li>
<li><code class="highlighter-rouge">ssh</code> into the server.</li>
<li>Start <code class="highlighter-rouge">tmux</code> and run the script. Among many things, <code class="highlighter-rouge">tmux</code> lets you
detach the session so you don’t have to wait for the job to finish
before interacting with the server again.</li>
</ol>
<p>When the script is sensible, I start diving into experiments with
multiple hyperparameter configurations.
A useful tool for this is
<a href="https://docs.python.org/3/library/argparse.html"><code class="highlighter-rouge">argparse</code></a>.
It augments a Python script with commandline arguments, where you
add something like the following to your script:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">()</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'--batch_size'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">int</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="mi">128</span><span class="p">,</span>
<span class="n">help</span><span class="o">=</span><span class="s">'Minibatch during training'</span><span class="p">)</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'--lr'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">float</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="mf">1e-5</span><span class="p">,</span>
<span class="n">help</span><span class="o">=</span><span class="s">'Learning rate step-size'</span><span class="p">)</span>
<span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">batch_size</span>
<span class="n">lr</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">lr</span>
</code></pre></div></div>
<p>Then you can run terminal commands such as</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python script1.py <span class="nt">--batch_size</span><span class="o">=</span>256 <span class="nt">--lr</span><span class="o">=</span>1e-4
</code></pre></div></div>
<p>This makes it easy to submit server jobs which vary these hyperparameters.</p>
<p>Finally, let’s talk about managing the output of experiments.
Recall the <code class="highlighter-rouge">src/</code> directory structure above:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-- src/
-- checkpoints/
-- codebase/
-- log/
-- out/
-- script1.py
-- script2.py
</code></pre></div></div>
<p>We described the individual scripts and <code class="highlighter-rouge">codebase/</code>.
The other three directories are for organizing experiment output:</p>
<ul>
<li><code class="highlighter-rouge">checkpoints/</code> records saved model parameters during training.
Use <code class="highlighter-rouge">tf.train.Saver</code> to save parameters as the algorithm runs every
fixed number of iterations. This helps with running long experiments, where you
might want to cut the experiment short and later restore the
parameters. Each experiment outputs a subdirectory in <code class="highlighter-rouge">checkpoints/</code>
with the convention,
<code class="highlighter-rouge">20170524_192314_batch_size_25_lr_1e-4/</code>. The first
number is the date (<code class="highlighter-rouge">YYYYMMDD</code>); the second is the timestamp (<code class="highlighter-rouge">%H%M%S</code>);
and the rest is hyperparameters.</li>
<li><code class="highlighter-rouge">log/</code> records logs for visualizing learning.
Each experiment belongs in a subdirectory with the same
convention as <code class="highlighter-rouge">checkpoints/</code>.
One benefit of Edward is that for logging, you can simply pass an
argument as <code class="highlighter-rouge">inference.initialize(logdir='log/' + subdir)</code>.
Default TensorFlow summaries are tracked which can then be
visualized using TensorBoard (more on this next).</li>
<li><code class="highlighter-rouge">out/</code> records exploratory output after training finishes; for example,
generated images or matplotlib plots.
Each experiment belongs in a subdirectory with the same convention
as <code class="highlighter-rouge">checkpoints/</code>.</li>
</ul>
<p><strong>On data sets.</strong>
Data sets are used across many research projects. I prefer storing them
in the home directory <code class="highlighter-rouge">~/data</code>.</p>
<p><strong>On software containers.</strong>
<a href="http://python-guide-pt-br.readthedocs.io/en/latest/dev/virtualenvs/">virtualenv</a>
is a must for managing Python dependencies and avoiding difficulties
with system-wide Python installs. It’s particularly nice if you like
to write Python 2/3-agnostic code.
<a href="https://www.docker.com">Docker containers</a> are an even more powerful
tool if you require more from your setup.</p>
<h2 id="exploration-debugging--diagnostics">Exploration, Debugging, & Diagnostics</h2>
<p><img src="/blog/assets/2017-06-03-fig4.png" alt="" />
<!--_<center>Picture thanks to -->
<!--<a href="https://github.com/blei-lab/edward/pull/653#issuecomment-304728311">-->
<!--Sean Kruzel</a>.</center>_ --></p>
<p><a href="https://www.tensorflow.org/get_started/summaries_and_tensorboard">Tensorboard</a>
is an excellent tool for visualizing and exploring your model
training. With TensorBoard’s interactivity, I find it
particularly convenient in that I don’t have to configure a bunch of
matplotlib functions to understand training. One only needs to percolate a
bunch of <code class="highlighter-rouge">tf.summary</code>s on tensors in the code.</p>
<p>Edward logs a bunch of summaries by default in order to visualize how
loss function values, gradients, and parameter change across
training iteration.
TensorBoard also includes wall time comparisons, and
a sufficiently decorated TensorFlow code base provides a nice
computational graph you can stare at.
For nuanced issues I can’t diagnose with TensorBoard specifically, I
just output things in the <code class="highlighter-rouge">out/</code> directory and inspect those results.</p>
<p><strong>Debugging error messages.</strong>
My debugging workflow is terrible. I
percolate print statements across my code and
find errors by
process of
elimination. This is primitive. Although I haven’t tried it, I
hear good things about
<a href="https://www.tensorflow.org/programmers_guide/debugger">TensorFlow’s debugger</a>.</p>
<h2 id="improving-research-understanding">Improving Research Understanding</h2>
<p>Interrogating your model, algorithm, and generally the learning
process lets you better understand your work’s success and failure
modes. This lets you go back to the drawing board, thinking deeply
about the method and how it might be further improved.
As the method indicates success, one can go
from tackling simple toy configurations to increasingly large
scale and high-dimensional problems.</p>
<p>From a higher level, this workflow is really about implementing the
scientific method in the real world. No major ideas are necessarily
discarded at each iteration of the experimental process, but rather,
as in the ideal of science, you start with fundamentals and
iteratively expand upon them as you have a stronger grasp of reality.</p>
<p>Experiments aren’t alone in this process either. Collaboration,
communicating with experts from other fields, reading papers, working
on both short and longer term ideas, and attending talks and
conferences help broaden your perspective in finding the right
problems and solving them.</p>
<h2 id="footnotes--references">Footnotes & References</h2>
<p><sup>1</sup> This workflow is specifically for empirical research.
Theory is a whole other can of worms, but some of these ideas
still generalize.</p>
<p><sup>2</sup>
The template for the master document is available
<a href="https://github.com/dustinvtran/latex-templates"><code class="highlighter-rouge">here</code></a>.</p>
<p><sup>3</sup>
There’s one caveat to Papers. I use it for everything: there are at
least 2,000 papers stored in my account, and with quite a few dense
textbooks. The application sifts through at least half a dozen
gigabytes, and so it suffers from a few hiccups when
reading/referencing back across many papers. I’m not sure if this is a
bug or just inherent to me exploiting Papers almost <em>too</em> much.</p>
<!--<sup>3</sup> -->
<!--Disclaimer: I wrote most of Edward. -->
<!--I personally benefit from the fact that if -->
<!--something is missing in Edward I can easily add it. -->
<!--[But of course you can (and should) add things too.](http://edwardlib.org/contributing)-->
<ol class="bibliography"><li><span id="neal1994bayesian">Neal, R. M. (1994). <i>Bayesian Learning for Neural Networks</i> (PhD thesis). University of Toronto.</span></li>
<li><span id="pearl2000causality">Pearl, J. (2000). <i>Causality</i>. Cambridge University Press.</span></li>
<li><span id="wainwright2008graphical">Wainwright, M. J., & Jordan, M. I. (2008). Graphical Models, Exponential Families, and Variational Inference. <i>Foundations and Trends in Machine Learning</i>, <i>1</i>(1–2), 1–305.</span></li></ol>
<!--__Failed ideas.__ -->
<!--They go back into master document. Or if there's a large collection of-->
<!--perpheral stuff, they remain as Github repos, but I personally store -->
<!--them in an `archives/` folder. They're put on hold, and I might-->
<!--revisit them over the years as I spark up new ideas. -->
<!--__Deployment to Larger Scales.__ -->
<!--your mileage will vary, given how you specifically deploy things.-->
<!--different clusters or machines. analysis of how to start placing -->
<!--device configurations in the code. -->
<!--+ device configurations -->
<!--+ pretrained models -->
<!--+ xla -->
<!--+ file readers -->
<!--+ distributed tensorflow stuff and data management systems -->
<!--+ redditors and new ml researchers would love it -->
<!--+ engineers and non ml experts could read it to understand where we -->
<!-- come from and how we work, so maybe -->
<!--other things i might mention -->
<!--+ note how write-up and formalism of idea can come before or after -->
<!-- first iteration of the workflow, depending on whether or kot a forst-->
<!-- iteration is needed to pass a dummy test of if the idea makes sense -->
<!--+ the different ways that you might do research -->
<!--+ my personal day to day on how much time i spend reading (relevant -->
<!-- papers to current research, other papers for breadth of knowledge, -->
<!-- long term understanding of new subjects), research of -->
<!-- thinking/writing, coding, meetingd. and also personal management of -->
<!-- ongoing work, such as maitaining and developing edward, and how many-->
<!-- independent projects to tackle at once -->
<!--+ code expansion, management of quality, as it grows bigger and bigger-->
Sat, 03 Jun 2017 00:00:00 -0700ICML 2017 Workshop on Implicit Models
http://dustintran.com/blog/implicit-models-workshop
http://dustintran.com/blog/implicit-models-workshop<p><a href="http://www.cs.columbia.edu/~blei/">David Blei</a>,
<a href="http://www.iangoodfellow.com">Ian Goodfellow</a>,
<a href="http://www.gatsby.ucl.ac.uk/~balaji/">Balaji Lakshminarayanan</a>,
<a href="http://shakirm.com">Shakir Mohamed</a>,
<a href="https://www.cs.princeton.edu/~rajeshr/">Rajesh Ranganath</a>,
and I are organizing a workshop at ICML this year, titled
“Implicit Models”.</p>
<p>Workshop URL: <a href="https://sites.google.com/view/implicitmodels/">https://sites.google.com/view/implicitmodels/</a></p>
<p>Leveraging this recent and highly impactful topic, I’m personally
excited to see how we might foster discussion across communities. (See
the <a href="https://sites.google.com/view/implicitmodels/bibliography">bibliography
page</a>
for detailed references.)</p>
<p>The deadline for paper submissions (including travel awards) is June 30, 2017.</p>
<h2 id="call-for-papers">Call For Papers</h2>
<p>Probabilistic models are an important tool in machine learning. They
form the basis for models that generate realistic data, uncover hidden
structure, and make predictions. Traditionally, probabilistic models
in machine learning have focused on prescribed models. Prescribed
models specify a joint density over observed and hidden variables that
can be easily evaluated. The requirement of a tractable density
simplifies their learning but limits their flexibility — several
real world phenomena are better described by simulators that do not
admit a tractable density. Probabilistic models defined only via the
simulations they produce are called implicit models.</p>
<p>Arguably starting with generative adversarial networks, research on
implicit models in machine learning has exploded in recent years. This
workshop’s aim is to foster a discussion around the recent
developments, commonalities among applications, and future directions
of implicit models.</p>
<p>We invite submission of papers for poster and short oral
presentations. Topics of interest include but are not limited to:</p>
<ul>
<li>implicit models,</li>
<li>generative adversarial networks,</li>
<li>adversarial training,</li>
<li>variational inference with implicit approximations,</li>
<li>approximate Bayesian computation,</li>
<li>likelihood free inference,</li>
<li>two sample testing and density ratio estimation,</li>
<li>theory,</li>
<li>evaluation and</li>
<li>applications of implicit models.</li>
</ul>
<p><strong>Key Dates:</strong></p>
<ul>
<li>June 30, 2017: Submission and Travel Award Deadline</li>
<li>July 14, 2017: Acceptance and Travel Award Notification</li>
<li>Aug 1, 2017: Final papers due</li>
<li>Aug 10, 2017: Workshop date</li>
</ul>
<p><strong>Submission Instructions:</strong>
Researchers interested in contributing should upload a short paper of
4 pages in PDF format by June 30, 2017, 11:59pm (time zone of your
choice) to the submission web site
https://sites.google.com/view/implicitmodels/submissions. References
and supplementary material can exceed 4 pages.</p>
<p>Authors should use the ICML style file. Submissions don’t need to be
anonymized. The workshop allows submissions of papers that are under
review or have been recently published in a conference or a journal.
Authors should state any overlapping published work at time of
submission.</p>
<p>All submissions will be reviewed and will be evaluated on the basis of
their technical content. Accepted papers will selected for either a
short oral presentation or spotlight presentation, in addition to
poster presentation.</p>
<p>If you have any questions, please contact us at implicitmodels2017@gmail.com.</p>
<p><strong>Confirmed Speakers:</strong></p>
<ul>
<li>Sanjeev Arora (Princeton)</li>
<li>Stefano Ermon (Stanford)</li>
<li>Qiang Liu (Dartmouth)</li>
<li>Kerrie Mengerson (Queensland University of Technology)</li>
<li>Dougal Sutherland (UCL)</li>
</ul>
Fri, 02 Jun 2017 00:00:00 -0700Deep and Hierarchical Implicit Models
http://dustintran.com/blog/deep-and-hierarchical-implicit-models
http://dustintran.com/blog/deep-and-hierarchical-implicit-models<p>I’m excited to announce a paper that Rajesh Ranganath, Dave Blei, and
I released today on arXiv, titled
<a href="https://arxiv.org/abs/1702.08896">Deep and Hierarchical Implicit Models</a>.</p>
<p>Implicit probabilistic models are all about sampling as a primitive:
they define a process to simulate data and do not require tractable
densities
(<a class="citation" href="#diggle1984monte">Diggle & Gratton (1984)</a>,
<a class="citation" href="#hartig2011statistical">Hartig, Calabrese, Reineking, Wiegand, & Huth (2011)</a>)
. We leverage this fundamental idea to develop new classes of
models: they encompass simulators in the scientific communities,
generative adversarial networks
<a class="citation" href="#goodfellow2014generative">(Goodfellow et al., 2014)</a>,
and deep generative models such as sigmoid
belief nets
<a class="citation" href="#neal1990learning">(Neal, 1990)</a>
and deep latent Gaussian models
(<a class="citation" href="#rezende2014stochastic">Rezende, Mohamed, & Wierstra (2014)</a>,
<a class="citation" href="#kingma2014autoencoding">Kingma & Welling (2014)</a>).
These modeling developments could not really be done without
inference, and we develop a variational inference algorithm that
underpins them all.</p>
<p>Biased as I am, I think this is quite a dense paper—chock full of
simple ideas that are rife with deep implications. There are many
nuggets of wisdom that I could ramble on about, and I just might in
separate blog posts.</p>
<p>As a practical example, we show how you can take any standard neural
network and turn it into a deep implicit model: simply inject noise
into the hidden layers. The hidden units in these layers are now
interpreted as latent variables. Further, the induced latent variables
are astonishingly flexible, going beyond Gaussians (or exponential
families
<a class="citation" href="#ranganath2015deep">(Ranganath, Tang, Charlin, & Blei, 2015)</a>)
to arbitrary probability distributions. Deep generative modeling could
not be any simpler!</p>
<p>Here’s a 2-layer deep implicit model in <a href="http://edwardlib.org">Edward</a>.
It defines the generative process,</p>
<script type="math/tex; mode=display">\begin{aligned}
\mathbf{z}_{n,2} = g_2(\mathbf{\epsilon}_{n,2}),\qquad
\mathbf{\epsilon}_{n, 2} \sim \text{Normal}(0, 1), \\
\mathbf{z}_{n,1} = g_1(\mathbf{\epsilon}_{n,1}\mid\mathbf{z}_{n,2}),\qquad
\mathbf{\epsilon}_{n, 1} \sim \text{Normal}(0, 1), \\
\mathbf{x}_{n} = g_0(\mathbf{\epsilon}_{n,0}\mid\mathbf{z}_{n,1}),\qquad
\mathbf{\epsilon}_{n, 0} \sim \text{Normal}(0, 1).
\end{aligned}</script>
<p>This generates layers of latent variables <script type="math/tex">\mathbf{z}_{n,1}</script>, <script type="math/tex">\mathbf{z}_{n,2}</script> and data <script type="math/tex">\mathbf{x}_{n}</script> via functions of noise <script type="math/tex">\mathbf{\epsilon}</script>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="kn">from</span> <span class="nn">edward.models</span> <span class="kn">import</span> <span class="n">Normal</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span>
<span class="n">N</span> <span class="o">=</span> <span class="mi">55000</span> <span class="c1"># number of data points
</span><span class="n">d</span> <span class="o">=</span> <span class="mi">100</span> <span class="c1"># noise dimensionality
</span>
<span class="c1"># random noise is Normal(0, 1)
</span><span class="n">eps2</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">zeros</span><span class="p">([</span><span class="n">N</span><span class="p">,</span> <span class="n">d</span><span class="p">]),</span> <span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">([</span><span class="n">N</span><span class="p">,</span> <span class="n">d</span><span class="p">]))</span>
<span class="n">eps1</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">zeros</span><span class="p">([</span><span class="n">N</span><span class="p">,</span> <span class="n">d</span><span class="p">]),</span> <span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">([</span><span class="n">N</span><span class="p">,</span> <span class="n">d</span><span class="p">]))</span>
<span class="n">eps0</span> <span class="o">=</span> <span class="n">Normal</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">zeros</span><span class="p">([</span><span class="n">N</span><span class="p">,</span> <span class="n">d</span><span class="p">]),</span> <span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">([</span><span class="n">N</span><span class="p">,</span> <span class="n">d</span><span class="p">]))</span>
<span class="c1"># alternate latent layers z with hidden layers h
</span><span class="n">z2</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)(</span><span class="n">eps2</span><span class="p">)</span>
<span class="n">h2</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)(</span><span class="n">z2</span><span class="p">)</span>
<span class="n">z1</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)(</span><span class="n">tf</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">eps1</span><span class="p">,</span> <span class="n">h2</span><span class="p">],</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">h1</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)(</span><span class="n">z1</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="bp">None</span><span class="p">)(</span><span class="n">tf</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">eps0</span><span class="p">,</span> <span class="n">h1</span><span class="p">],</span> <span class="mi">1</span><span class="p">))</span>
</code></pre></div></div>
<p>The model uses Keras, where <code class="highlighter-rouge">Dense(256)(x)</code> denotes a fully connected
layer with <script type="math/tex">256</script> hidden units applied to input <code class="highlighter-rouge">x</code>. To define a
stochastic layer, we concatenate noise with the previous layer. The
model alternates between stochastic and deterministic layers to
generate data points <script type="math/tex">\mathbf{x}_n\in\mathbb{R}^{10}</script>.</p>
<p>Check out the paper for how you can work with, or even interpret, such a model.</p>
<p>EDIT (2017/03/02): The algorithm is now <a href="https://github.com/blei-lab/edward/pull/491">merged into Edward</a>.</p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="diggle1984monte">Diggle, P. J., & Gratton, R. J. (1984). Monte Carlo methods of inference for implicit statistical models. <i>Journal of the Royal Statistical Society Series B</i>.</span></li>
<li><span id="goodfellow2014generative">Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative Adversarial Nets. In <i>Neural Information Processing Systems</i>.</span></li>
<li><span id="hartig2011statistical">Hartig, F., Calabrese, J. M., Reineking, B., Wiegand, T., & Huth, A. (2011). Statistical inference for stochastic simulation models - theory and application. <i>Ecology Letters</i>, <i>14</i>(8), 816–827.</span></li>
<li><span id="kingma2014autoencoding">Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. In <i>International Conference on Learning Representations</i>.</span></li>
<li><span id="neal1990learning">Neal, R. M. (1990). <i>Learning Stochastic Feedforward Networks</i>.</span></li>
<li><span id="ranganath2015deep">Ranganath, R., Tang, L., Charlin, L., & Blei, D. M. (2015). Deep Exponential Families. In <i>Artificial Intelligence and Statistics</i>.</span></li>
<li><span id="rezende2014stochastic">Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In <i>International Conference on Machine Learning</i>.</span></li></ol>
Tue, 28 Feb 2017 00:00:00 -0800Video resources for machine learning (an update)
http://dustintran.com/blog/video-resources-for-machine-learning-update
http://dustintran.com/blog/video-resources-for-machine-learning-update<p>Last year I shared my collection of <a href="/blog/blog/video-resources-for-machine-learning">video resources for
machine learning</a>. (I attribute a significant portion of my education to these videos.)</p>
<p>It was unwieldy taking individual e-mail requests and updating a
time-stamped blog post. So now it’s a Github repo. <a href="https://github.com/dustinvtran/ml-videos">Enjoy.</a></p>
Tue, 24 Jan 2017 00:00:00 -0800On Model Mismatch and Bayesian Analysis
http://dustintran.com/blog/on-model-mismatch-and-bayesian-analysis
http://dustintran.com/blog/on-model-mismatch-and-bayesian-analysis<p>One aspect I always enjoy about machine learning is that questions
often go back to the basics. The field essentially goes into an
existential crisis every dozen years—rethinking our tools and asking foundational questions
such as “why neural networks” or “why generative models”.<sup>1</sup></p>
<p>This was a theme in my conversations during
<a href="https://nips.cc/Conferences/2016">NIPS 2016</a> last week, where a
frequent topic was
on the advantages of a Bayesian perspective to machine learning.
Not surprisingly, this appeared as a big discussion point during the
panel at the <a href="http://bayesiandeeplearning.org">Bayesian deep learning
workshop</a>, where many
panelists were conciliatory to the use of non-Bayesian approaches.
(Granted, much of it was Neil trolling them to admit when non-Bayesian
approaches worked better in practice.)</p>
<p>One argument against Bayesian analysis went as follows:</p>
<blockquote>
<p>While Bayesian inference can capture uncertainty about parameters,
it relies on the model being correctly specified. However, in
practice, all models are wrong. And in fact, this model mismatch can
be often be large enough that we should be more concerned with
calibrating our inferences to correct for the mismatch than to
produce uncertainty estimates from incorrect assumptions.</p>
</blockquote>
<p>A related complaint was on the separation of model and
inference, a philosophical point commonly associated with Bayesians:</p>
<blockquote>
<p>While in principle it is nice that we can build models separate from
our choice of inference, we often need to combine the two in practice. (The whole
naming behind the popular model-inference classes of “variational
auto-encoders” <a class="citation" href="#kingma2014autoencoding">(Kingma & Welling, 2014)</a>
and “generative adversarial networks” <a class="citation" href="#goodfellow2014generative">(Goodfellow et al., 2014)</a> are one
example.) That is, we often choose our model based on what we know
enables fast inferences, or we select hyperparameters in our model
from data. This goes against the Bayesian paradigm.</p>
</blockquote>
<p>First, I’d like to say immediately that I think interpreting Bayesian
analysis as a two-step procedure of setting up a probability model,
then performing posterior inference is outdated. Certainly this was the
prevailing perspective back in the 80s’ and 90s’ when Markov chain Monte Carlo
was first popularized, and when statisticians started to take Bayesian analysis
more seriously <a class="citation" href="#robert2011short">(Robert & Casella, 2011)</a>.</p>
<p>Quoting <a class="citation" href="#gelman2012philosophy">Gelman & Shalizi (2012)</a>
who summarize this perspective,
“The expression <script type="math/tex">p(\theta\mid y)</script> says it all, and the central goal of Bayesian inference is computing the posterior probabilities of hypotheses. Anything not contained in the posterior distribution <script type="math/tex">p(\theta\mid y)</script> is simply irrelevant, and it would be irrational (or incoherent) to attempt falsification, unless that somehow shows up in the posterior.”</p>
<p><strong>Like many statisticians before me</strong>
(e.g., <a class="citation" href="#box1980sampling">Box (1980)</a>,
<a class="citation" href="#good1983good">Good (1983)</a>,
<a class="citation" href="#rubin1984bayesianly">Rubin (1984)</a>,
<a class="citation" href="#jaynes2003probability">Jaynes (2003)</a>),
<strong>I believe this perspective is wrong. Bayesian analysis is no
different in its testing and falsification of models than any other
inferential paradigm</strong>
(<a class="citation" href="#fisher1925statistical">Fisher (1925)</a>,
<a class="citation" href="#neyman1933on">Neyman & Pearson (1933)</a>).</p>
<p>An important third step to all empirical analyses is <em>model criticism</em>
(<a class="citation" href="#box1980sampling">Box (1980)</a>,
<a class="citation" href="#ohagan2011hsss">O’Hagan (2001)</a>
),
also known as model
validation, or model
checking and diagnostics
(<a class="citation" href="#rubin1984bayesianly">Rubin (1984)</a>,
<a class="citation" href="#meng1994posterior">Meng (1994)</a>,
<a class="citation" href="#gelman1996posterior">Gelman, Meng, & Stern (1996)</a>).
In criticizing our models after inference, we can either justify
use of the model or find directions in which we can revise the model.
By revising the model, we go back to the modeling step, thus forming
a loop, called <em>Box’s loop</em>
(<a class="citation" href="#box1976science">Box (1976)</a>,
<a class="citation" href="#blei2014build">Blei (2014)</a>,
<a class="citation" href="#gelman2013bayesian">Gelman et al. (2013)</a>).
<sup>2</sup></p>
<p>From my perspective, this solves the perceived problem of conflating
model and inference, whether it be to address model mismatch or to
build the model from previous inferences or data.
That is, while
posterior inference is simply a mechanical step of calculating a
conditional distribution, the
step of model criticism is about the relevance of the model to future
data—to put it in statistical terms, the relevance of the model with
respect to a population distribution
<a class="citation" href="#wasserman2006frequentist">(Wasserman, 2006)</a>.
As with data, the model is
just a source of information, and posterior inference simply aggregates these
two sources of information. Thus it
makes sense that as we better understand properties of the data, we
can revise our information to better formulate a model of it
<a class="citation" href="#tukey1977exploratory">(Tukey, 1977)</a>.</p>
<p>This might sound like an awkward way to shoehorn Bayesian analysis to
mimick frequentist properties, or no different from combining model
and inference from the get-go.
However, this loop is fundamental because it still emphasizes the
importance of separating the two. We can continue to form
hypothetico-deductive analyses—namely, a falsificationist view of
the world where components of model, inference, and criticism
interact—while still incorporating posterior probabilities.</p>
<p>For more details, I highly recommend
<a class="citation" href="#gelman2012philosophy">Gelman & Shalizi (2012)</a>
and of course the classic,
<a class="citation" href="#rubin1984bayesianly">Rubin (1984)</a>.</p>
<p><sup>1</sup>
I take an optimistic viewpoint to the trend of cycling among tools for
machine learning. The trend is based on what works best empirically,
and I think that’s important.</p>
<p><sup>2</sup>
As a plug, I should also mention that this is what <a href="http://edwardlib.org">Edward</a> is all about.</p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="blei2014build">Blei, D. M. (2014). Build, compute, critique, repeat: Data analysis with latent variable models. <i>Annual Review of Statistics and Its Application</i>.</span></li>
<li><span id="box1980sampling">Box, G. E. P. (1980). Sampling and Bayes’ inference in scientific modelling and robustness. <i>Journal of the Royal Statistical Society. Series A. General</i>, <i>143</i>(4), 383–430.</span></li>
<li><span id="box1976science">Box, G. E. P. (1976). Science and statistics. <i>Journal of the American Statistical Association</i>, <i>71</i>(356), 791–799.</span></li>
<li><span id="fisher1925statistical">Fisher, R. A. (1925). <i>Statistical Methods for Research Workers</i>. Genesis Publishing Pvt Ltd.</span></li>
<li><span id="gelman2013bayesian">Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). <i>Bayesian data analysis</i> (Third). CRC Press, Boca Raton, FL.</span></li>
<li><span id="gelman1996posterior">Gelman, A., Meng, X.-L., & Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. <i>Statistica Sinica</i>.</span></li>
<li><span id="gelman2012philosophy">Gelman, A., & Shalizi, C. R. (2012). Philosophy and the practice of Bayesian statistics. <i>British Journal of Mathematical and Statistical Psychology</i>, <i>66</i>(1), 8–38.</span></li>
<li><span id="good1983good">Good, I. J. (1983). <i>Good thinking: The foundations of probability and its applications</i>. U of Minnesota Press.</span></li>
<li><span id="goodfellow2014generative">Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative Adversarial Nets. In <i>Neural Information Processing Systems</i>.</span></li>
<li><span id="jaynes2003probability">Jaynes, E. T. (2003). Probability theory: The logic of science. Washington University St. Louis, MO.</span></li>
<li><span id="kingma2014autoencoding">Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. In <i>International Conference on Learning Representations</i>.</span></li>
<li><span id="meng1994posterior">Meng, X.-L. (1994). Posterior predictive p-values. <i>The Annals of Statistics</i>.</span></li>
<li><span id="neyman1933on">Neyman, J., & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. <i>Philosophical Transactions of the Royal Society A Mathematical, Physical and Engineering Sciences</i>, <i>231</i>, 289–337.</span></li>
<li><span id="ohagan2011hsss">O’Hagan, A. (2001). <i>HSSS model criticism</i>. University of Sheffield, Department of Probability and Statistics.</span></li>
<li><span id="robert2011short">Robert, C., & Casella, G. (2011). A short history of Markov Chain Monte Carlo: subjective recollections from incomplete data. <i>Statistical Science</i>.</span></li>
<li><span id="rubin1984bayesianly">Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. <i>The Annals of Statistics</i>, <i>12</i>(4), 1151–1172.</span></li>
<li><span id="tukey1977exploratory">Tukey, J. W. (1977). Exploratory data analysis.</span></li>
<li><span id="wasserman2006frequentist">Wasserman, L. (2006). Frequentist Bayes is objective (comment on articles by Berger and by Goldstein). <i>Bayesian Analysis</i>, <i>1</i>(3), 451–456.</span></li></ol>
Tue, 13 Dec 2016 00:00:00 -0800Two papers released on arXiv, "Operator Variational Inference" and "Model Criticism for Bayesian Causal Inference"
http://dustintran.com/blog/two-papers-released-on-arxiv
http://dustintran.com/blog/two-papers-released-on-arxiv<p>Two papers of mine were released today on arXiv.</p>
<ul>
<li><a href="https://arxiv.org/abs/1610.09033">Operator variational inference</a>, in collaboration with Rajesh Ranganath, Jaan Altosaar, and David Blei.</li>
<li><a href="https://arxiv.org/abs/1610.09037">Model criticism for Bayesian causal inference</a>, in collaboration with Francisco Ruiz, Susan Athey, and David Blei.</li>
</ul>
<p>Last week, I gave a talk at OpenAI on operator variational
inference and Edward. I can now release those
<a href="http://dustintran.com/talks/Tran_Operator_Edward.pdf">slides online</a>.</p>
<h2 id="operator-variational-inference">Operator variational inference</h2>
<p>Operator VI is a paper I’m really excited about. It is at
<a href="https://nips.cc">NIPS</a> this year. Most directly, it’s a continuation of work that
Rajesh and I have been developing on the aim for more expressive
approximations for variational inference. We’ve seen this with the
variational Gaussian process <a class="citation" href="#tran2016variational">(Tran, Ranganath, & Blei, 2016)</a> and hierarchical variational models <a class="citation" href="#ranganath2016hierarchical">(Ranganath, Tran, & Blei, 2016)</a> (and if you’ve
read my older work, copula variational inference
<a class="citation" href="#tran2015copula">(Tran, Blei, & Airoldi, 2015)</a>).</p>
<p>More generally, in variational inference, we always make tradeoffs
between the statistical efficiency of the approximation and the
computational complexity of the algorithm. (This is partly what Andrew
Gelman calls the “efficiency frontier”.) However, we don’t quite have
a knob for controlling this tradeoff, nor do we have a way of even
formalizing these notions.</p>
<p>Operator VI is a proposed solution to this problem. It formalizes
these tradeoffs, and it analyzes how we can characterize different
approaches to variational inference in order to achieve specific
aims. As one example, we show how to develop the most expressive
posterior approximations, which we call “variational programs”.
Variational programs do not require a tractable density, and they bring
variational inference closer to powerful inferential techniques as in
generative adversarial networks
<a class="citation" href="#goodfellow2014generative">(Goodfellow et al., 2014)</a>.</p>
<h2 id="model-criticism-for-bayesian-causal-inference">Model criticism for Bayesian causal inference</h2>
<p>To me, causal inference is one of the most interesting fields in statistics
and machine learning, and with the greatest potential for long term impact.
It can significantly speed up progress towards something like
artificial general intelligence (and is arguably necessary to achieve it). And most immediately, it enables richer
data analyses to capture scientific phenomena. In order for our models
to truly infer generative processes, they must understand and learn
causal notions of the world.</p>
<p>Much of the work in the causal inference community has focused on
nonparametric models, which make few modeling assumptions. They
satisfy theoretic notions such as asymptotics and can perform well on
small-to-medium size data sets (a typical setting setting in applied
causal inference). However, in higher-dimensional and massive data
settings, we require more complex generative models,
as we’ve seen in probabilistic machine learning.</p>
<p>There’s a caveat to this. Before being able to build rich, complex (and possibly deep)
causal models, we first need a way of evaluating them. This arXiv
paper addresses that issue. It is a foundational question
more generally in the area of model criticism, also known as model
checking and diagnostics. We ask the question, “To what extent
is my model falsified by the empirical data?”. By answering it, we can
probe different assumptions in our model and possibly revise them,
thus better capturing causal mechanisms.</p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="goodfellow2014generative">Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. In <i>Neural Information Processing Systems</i>.</span></li>
<li><span id="ranganath2016hierarchical">Ranganath, R., Tran, D., & Blei, D. M. (2016). Hierarchical variational models. In <i>International Conference on Machine Learning</i>.</span></li>
<li><span id="tran2015copula">Tran, D., Blei, D. M., & Airoldi, E. M. (2015). Copula variational inference. In <i>Neural Information Processing Systems</i>.</span></li>
<li><span id="tran2016variational">Tran, D., Ranganath, R., & Blei, D. M. (2016). The variational Gaussian process. In <i>International Conference on Learning Representations</i>.</span></li></ol>
Sun, 30 Oct 2016 00:00:00 -0700NIPS 2016 Workshop on Approximate Inference
http://dustintran.com/blog/nips-2016-workshop-on-approximate-inference
http://dustintran.com/blog/nips-2016-workshop-on-approximate-inference<p>We’re organizing a NIPS workshop on approximate inference. It is together with Tamara Broderick, Stephan Mandt, and James McInerney—and alongside an incredible cast of seminal researchers: David Blei, Andrew Gelman, Mike Jordan, and Kevin Murphy. [<a href="http://approximateinference.org">Workshop homepage</a>]</p>
<p>This year, we set a theme based on what we believe are some of the most important challenges. In particular, there’s an emphasis on the practice of approximate inference, whether it be challenges which arise in applications or in software. Advances in both methodology and theory are of course crucial to achieve this end-goal; we also highly encourage such work.</p>
<p><strong>Note</strong>: We have (quite a few!) travel awards. If you’re interested in applying, the travel
award and early application deadline is October 7.</p>
<p>Call for papers below.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>We invite researchers in machine learning and statistics to participate in the:
NIPS 2016 Workshop on Advances in Approximate Bayesian Inference
Friday 9 December 2016, Barcelona, Spain
www.approximateinference.org
Submission deadline: 1 November 2016
1. Call for Participation
We invite researchers to submit their recent work on the development, analysis, or application of approximate Bayesian inference. A submission should take the form of an extended abstract of 2-4 pages in PDF format using the NIPS style. Author names do not need to be anonymized and references may extend as far as needed beyond the 4 page upper limit. If authors' research has previously appeared in a journal, workshop, or conference (including the NIPS 2016 conference), their workshop submission should extend that previous work. Submissions may include a supplement/appendix, but reviewers are not responsible for reading any supplementary material.
Submissions will be accepted either as contributed talks or poster presentations. Extended abstracts should be submitted by 1 November; see website for submission details. Final versions of the extended abstract are due by 5 December, and will be posted on the workshop website.
2. Workshop Overview
Bayesian analysis has seen a resurgence in machine learning, expanding its scope beyond traditional applications. Increasingly complex models have been trained with large and streaming data sets, and they have been applied to a diverse range of domains. Key to this resurgence has been advances in approximate Bayesian inference. Variational and Monte Carlo methods are currently the mainstay techniques, where recent insights have improved their approximation quality, provided black box strategies for fitting many models, and enabled scalable computation.
In this year's workshop, we would like to continue the theme of approximate Bayesian inference with additional emphases. In particular, we encourage submissions not only advancing approximate inference but also regarding (1) unconventional inference techniques, with the aim to bring together diverse communities; (2) software tools for both the applied and methodological researcher; and (3) challenges in applications, both in non-traditional domains and when applying these techniques to advance current domains.
This workshop is a continuation of past years:
+ NIPS 2015 Workshop: Advances in Approximate Bayesian Inference
+ NIPS 2014 Workshop: Advances in Variational Inference
This workshop has been endorsed by the International Society for Bayesian Analysis (ISBA) and is supported by Disney Research.
3. Confirmed Speakers and Panelists
Invited speakers:
Barbara Engelhardt (Princeton University)
Surya Ganguli (Stanford University)
Jonathan Huggins (MIT)
Jeffrey Regier (UC Berkeley)
Matthew Johnson (Harvard University)
Panel: Software
TBA (Stan)
Noah Goodman (WebPPL; Stanford University)
Dustin Tran (Edward; Columbia University)
TBA (TensorFlow, BayesFlow; Google)
Michael Hughes (BNPy; Harvard University)
Panel: On the Foundations and Future of Approximate Inference
Ryan Adams (Harvard University, Twitter Cortex)
Barbara Engelhardt (Princeton University)
Philip Hennig (Max Planck Institute for Intelligent Systems)
Richard Turner (University of Cambridge)
Neil Lawrence (University of Sheffield)
4. Key Dates
Travel award application deadline: 7 October 2016
Early acceptance notification: 7 October 2016
Paper submission: 1 November 2016
Acceptance notification: 16 November 2016
Travel award notification: 16 November 2016
Final paper submission: 5 December 2016
Workshop organizers:
Tamara Broderick (MIT)
Stephan Mandt (Disney Research)
James McInerney (Columbia University)
Dustin Tran (Columbia University)
Advisory committee:
David Blei (Columbia University)
Andrew Gelman (Columbia University)
Michael Jordan (UC Berkeley)
Kevin Murphy (Google)
</code></pre></div></div>
Fri, 30 Sep 2016 00:00:00 -0700Discussion of "Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing"
http://dustintran.com/blog/discussion-of-fast-approximate-inference
http://dustintran.com/blog/discussion-of-fast-approximate-inference<p><em>This article is written with much help by David Blei. It is extracted from a discussion paper on “Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing”.</em> <a href="http://arxiv.org/abs/1609.05615">[link]</a></p>
<p>We commend <a class="citation" href="#wand2016fast">Wand (2016)</a> for an excellent description of
message passing (<span style="font-variant:small-caps;">mp</span>) and for developing it to infer large semiparametric
regression models. We agree with the author in fully embracing the
modular nature of message passing, where one can define “fragments”
that enable us to compose localized algorithms. We believe this
perspective can aid in the development of new algorithms for automated
inference.</p>
<p><strong>Automated inference.</strong> The promise of automated algorithms is
that modeling and inference can be separated. A user can construct
large, complicated models in accordance with the assumptions he or she
is willing to make about their data. Then the user can use generic
inference algorithms as a computational backend in a “probabilistic
programming language,” i.e., a language for specifying generative
probability models.</p>
<p>With probabilistic programming, the user no longer has to write their
own algorithms, which may require tedious model-specific derivations
and implementations. In the same spirit, the user no longer has to
bottleneck their modeling choices in order to fit the requirements of
an existing model-specific algorithm. Automated inference enables
probabilistic programming systems, such as
Stan <a class="citation" href="#carpenter2016stan">(Carpenter et al., 2016)</a>, through methods like
automatic differentiation variational inference (<span style="font-variant:small-caps;">advi</span>) <a class="citation" href="#kucukelbir2016automatic">(Kucukelbir, Tran, Ranganath, Gelman, & Blei, 2016)</a> and
no U-turn sampler (<span style="font-variant:small-caps;">nuts</span>) <a class="citation" href="#hoffman2014nuts">(Hoffman & Gelman, 2014)</a>.</p>
<p>Though they aim to apply to a large class of models, automated
inference algorithms typically need to incorporate modeling structure
in order to remain practical. For example, Stan assumes that one can
at least take gradients of a model’s joint density. (Contrast this
with other languages which assume one can only sample from the model.)
However, more structure is often necessary: <span style="font-variant:small-caps;">advi</span> and <span style="font-variant:small-caps;">nuts</span>
are not fast enough by themselves to infer very large models, such as
hierarchical models with many groups.</p>
<p>We believe <span style="font-variant:small-caps;">mp</span> and Wand’s work could offer fruitful avenues for
expanding the frontiers of automated inference. From our perspective,
a core principle underlying <span style="font-variant:small-caps;">mp</span> is to leverage structure when it
is available—in particular, statistical properties in the model—which provides useful computational properties. In <span style="font-variant:small-caps;">mp</span>, two
examples are conditional independence and conditional conjugacy.</p>
<p><strong>From conditional independence to distributed computation.</strong>
As <a class="citation" href="#wand2016fast">Wand (2016)</a> indicates, a crucial advantage of message
passing is that it modularizes inference; the computation can be
performed separately over conditionally independent posterior
factors. By definition, conditional independence separates a posterior
factor from the rest of the model, which enables <span style="font-variant:small-caps;">mp</span> to define a
series of iterative updates. These updates can be run asynchronously
and in a distributed environment.</p>
<center>
<img src="/blog/assets/2016-09-19-figure.png" style="width:200px;" />
</center>
<p><em>Figure 1.
A hierarchical model, with latent variables <script type="math/tex">\alpha_k</script> defined locally
per group and latent variables <script type="math/tex">\phi</script> defined globally to be shared across groups.</em></p>
<p>We are motivated by hierarchical models, which substantially benefit
from this property. Formally, let <script type="math/tex">y_{nk}</script> be the <script type="math/tex">n^{th}</script> data
point in group <script type="math/tex">k</script>, with a total of <script type="math/tex">N_k</script> data points in group <script type="math/tex">k</script> and
<script type="math/tex">K</script> many groups. We model the data using local latent variables
<script type="math/tex">\alpha_k</script> associated to a group <script type="math/tex">k</script>, and using global latent
variables <script type="math/tex">\phi</script> which are shared across groups. The model is depicted
in Figure 1.</p>
<p>The posterior distribution of local variables <script type="math/tex">\alpha_k</script> and global
variables <script type="math/tex">\phi</script> is</p>
<script type="math/tex; mode=display">p(\alpha,\phi\mid\mathbf{y}) \propto
p(\phi\mid\mathbf{y}) \prod_{k=1}^K
\Big[ p(\alpha_k\mid \beta) \prod_{n=1}^{N_K} p(y_{nk}\mid\alpha_k,\phi) \Big].</script>
<p>The benefit of distributed updates over the independent factors is
immediate. For example, suppose the data consists of 1,000 data points
per group (with 5,000 groups); we model it with 2 latent variables per
group and 20 global latent variables. Passing messages, or
inferential updates, in parallel provides an attractive approach to
handling all 10,020 latent dimensions. (In contrast, consider a
sequential algorithm that requires taking 10,019 steps for all other
variables before repeating an update of the first.)</p>
<p>While this approach to leveraging conditional independence is
straightforward from the message passing perspective, it is not
necessarily immediate from other perspectives. For example, the
statistics literature has only recently come to similar ideas,
motivated by scaling up Markov chain Monte Carlo using divide and
conquer strategies <a class="citation" href="#huang2005sampling">(Huang & Gelman, 2005; Wang & Dunson, 2013)</a>.
These first analyze data locally over a partition of the joint
density, and second aggregate the local inferences. In our work in
<a class="citation" href="#gelman2014expectation">Gelman et al. (2014)</a>, we arrive at the continuation of this
idea. Like message passing, the process is iterated, so that local
information propagates to global information and global information
propagates to local information. In doing so, we obtain a scalable
approach to Monte Carlo inference, both from a top-down view which
deals with fitting statistical models to large data sets and from a
bottom-up view which deals with combining information across local
sources of data and models.</p>
<p><strong>From conditional conjugacy to exact iterative updates.</strong>
Another important element of message passing algorithms is conditional
conjugacy, which lets us easily calculate the exact distribution for a
posterior factor conditional on other latent variables. This enables
analytically tractable messages (c.f., Equations (7)-(8) of
<a class="citation" href="#wand2016fast">Wand (2016)</a>).</p>
<p>Consider the same hierarchical model discussed above, and set</p>
<script type="math/tex; mode=display">p(y_k,\alpha_k\mid \phi)
= h(y_k, \alpha_k) \exp\{\phi^\top t(y_k, \alpha_k) - a(\phi)\},</script>
<script type="math/tex; mode=display">p(\phi)
= h(\phi) \exp\{\eta^{(0) \top} t(\phi) - a(\eta_0)\}
.</script>
<p>The local factor <script type="math/tex">p(y_k,\alpha_k\mid\phi)</script> has sufficient statistics
<script type="math/tex">t(y_k,\alpha_k)</script> and natural parameters given by the global latent
variable <script type="math/tex">\phi</script>. The global factor <script type="math/tex">p(\phi)</script> has sufficient
statistics <script type="math/tex">t(\phi) = (\phi, -a(\phi))</script>, and with fixed
hyperparameters <script type="math/tex">\eta^{(0)}</script>, which has two components: <script type="math/tex">\eta^{(0)} =
(\eta^{(0)}_1,\eta^{(0)}_2)</script>.</p>
<p>This exponential family structure implies that, conditionally, the
posterior factors are also in the same exponential families
as the prior factors <a class="citation" href="#diaconis1979conjugate">(Diaconis & Ylvisaker, 1979)</a>,</p>
<script type="math/tex; mode=display">p(\phi\mid\mathbf{y},\alpha)
= h(\phi) \exp\{\eta(\mathbf{y},\alpha)^\top t(\phi) - a(\mathbf{y},\alpha)\},</script>
<script type="math/tex; mode=display">p(\alpha_k\mid y_k, \phi)
= h(\alpha_k) \exp\{\eta(y_k, \phi)^\top t(\alpha_k) - a(y_k, \phi)\}
.</script>
<p>The global factor’s natural parameter is <script type="math/tex">\eta(\mathbf{y},\alpha) =
(\eta^{(0)}_1 + \sum_{k=1}^K t(y_k, \alpha_k), \eta^{(0)}_2 + \sum_{k=1}^K N_k)</script>.</p>
<p>With this statistical property at play—namely that conjugacy gives
rise to tractable conditional posterior factors—we can derive
algorithms at a conditional level with exact iterative updates. This
is assumed for most of the message passing of semiparametric models in
<a class="citation" href="#wand2016fast">Wand (2016)</a>. Importantly, this is not necessarily a
limitation of the algorithm. It is a testament to leveraging model
structure: without access to tractable conditional posteriors,
additional approximations must be made. <a class="citation" href="#wand2016fast">Wand (2016)</a> provides
an elegant way to separate out these nonconjugate pieces from the
conjugate pieces.</p>
<p>In statistics, the most well-known example which leverages
conditionally conjugate factors is the Gibbs sampling algorithm. From
our own work, we apply the idea in order to access fast natural
gradients in variational inference, which accounts for the information
geometry of the parameter space <a class="citation" href="#hoffman2013stochastic">(Hoffman, Blei, Wang, & Paisley, 2013)</a>. In
other work, we demonstrate a collection of methods for gradient-based
marginal optimization <a class="citation" href="#tran2016gradient">(Tran, Gelman, & Vehtari, 2016)</a>. Assuming forms of
conjugacy in the model class arrives at the classic idea of
iteratively reweighted least squares as well as the EM algorithm. Such
structure in the model provides efficient algorithms—both
statistically and computationally—for their automated inference.</p>
<p><strong>Open Challenges and Future Directions.</strong> Message passing is a
classic algorithm in the computer science literature, which is ripe
with interesting ideas for statistical inference. In particular,
<span style="font-variant:small-caps;">mp</span> enables new advancements in the realm of automated inference,
where one can take advantage of statistical structure in the model.
<a class="citation" href="#wand2016fast">Wand (2016)</a> makes great steps following this direction.</p>
<p>With that said, important open challenges still exist in order to
realize this fusion.</p>
<p>First is about the design and implementation of probabilistic
programming languages. In order to implement <a class="citation" href="#wand2016fast">Wand (2016)</a>’s
message passing, the language must provide ways of identifying local
structure in a probabilistic program. While that is enough to let
practitioners use <span style="font-variant:small-caps;">mp</span>, a much larger challenge is to
then automate the process of detecting local structure.</p>
<p>Second is about the design and implementation of inference engines.
The inference must be extensible, so that users can not only employ
the algorithm in <a class="citation" href="#wand2016fast">Wand (2016)</a> but easily build on top of it.
Further, its infrastructure must be able to encompass a variety of
algorithms, so that users can incorporate <span style="font-variant:small-caps;">mp</span> as one of many
tools in their toolbox.</p>
<p>Third, we think there are innovations to be made on taking the stance
of modularity to a further extreme. In principle, one can compose not
only localized message passing updates but compose localized inference
algorithms of any choice—whether it be exact inference, Monte Carlo,
or variational methods. This modularity will enable new
experimentation with inference hybrids and can bridge the gap among
inference methods.</p>
<p>Finally, while we discuss <span style="font-variant:small-caps;">mp</span> in the context of automation,
fully automatic algorithms are not possible. Associated to all
inference are statistical and computational
tradeoffs <a class="citation" href="#jordan2013statistics">(Jordan, 2013)</a>. Thus we need algorithms along
the frontier, where a user can explicitly define a computational
budget and employ an algorithm achieving the best statistical
properties within that budget; or conversely, define desired
statistical properties and employ the fastest algorithm to achieve
them. We think ideas in <span style="font-variant:small-caps;">mp</span> will also help in developing some of
these algorithms.</p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="carpenter2016stan">Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., … Riddell, A. (2016). Stan: A probabilistic programming language. <i>Journal of Statistical Software</i>.</span></li>
<li><span id="diaconis1979conjugate">Diaconis, P., & Ylvisaker, D. (1979). Conjugate Priors for Exponential Families. <i>The Annals of Statistics</i>, <i>7</i>(2), 269–281.</span></li>
<li><span id="gelman2014expectation">Gelman, A., Vehtari, A., Jylänki, P., Robert, C., Chopin, N., & Cunningham, J. P. (2014). Expectation propagation as a way of life. <i>ArXiv Preprint ArXiv:1412.4869</i>.</span></li>
<li><span id="hoffman2013stochastic">Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic Variational Inference. <i>Journal of Machine Learning Research</i>, <i>14</i>, 1303–1347.</span></li>
<li><span id="hoffman2014nuts">Hoffman, M. D., & Gelman, A. (2014). The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. <i>Journal of Machine Learning Research</i>, <i>15</i>, 1593–1623.</span></li>
<li><span id="huang2005sampling">Huang, Z., & Gelman, A. (2005). Sampling for Bayesian Computation with Large Datasets. <i>SSRN Electronic Journal</i>.</span></li>
<li><span id="jordan2013statistics">Jordan, M. I. (2013). On statistics, computation and scalability. <i>Bernoulli</i>, <i>19</i>(4), 1378–1390.</span></li>
<li><span id="kucukelbir2016automatic">Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2016). Automatic Differentiation Variational Inference. <i>ArXiv Preprint ArXiv:1603.00788</i>.</span></li>
<li><span id="tran2016gradient">Tran, D., Gelman, A., & Vehtari, A. (2016). Gradient-based marginal optimization. <i>Technical Report</i>.</span></li>
<li><span id="wand2016fast">Wand, M. P. (2016). Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing. <i>ArXiv Preprint ArXiv:1602.07412</i>.</span></li>
<li><span id="wang2013parallelizing">Wang, X., & Dunson, D. B. (2013). Parallelizing MCMC via Weierstrass sampler. <i>ArXiv Preprint ArXiv:1312.4605</i>.</span></li></ol>
Mon, 19 Sep 2016 00:00:00 -0700