Automatic Differentiation Variational Inference

Authors: Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, David M. Blei

JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study advi across ten modern probabilistic models and apply it to a dataset with millions of observations. ... Section 3 studies the properties of advi. We explore its accuracy, its stochastic nature, and its sensitivity to transformations. Section 4 applies advi to an array of probability models. We compare its speed to mcmc sampling techniques and present a case study using a dataset with millions of observations.
Researcher Affiliation Academia Alp Kucukelbir EMAIL Data Science Institute, Department of Computer Science Columbia University New York, NY 10027, USA; Dustin Tran EMAIL Department of Computer Science Columbia University New York, NY 10027, USA; Rajesh Ranganath EMAIL Department of Computer Science Princeton University Princeton, NJ 08540, USA; Andrew Gelman EMAIL Data Science Institute, Departments of Political Science and Statistics Columbia University New York, NY 10027, USA; David M. Blei EMAIL Data Science Institute, Departments of Computer Science and Statistics Columbia University New York, NY 10027, USA
Pseudocode Yes Algorithm 1: Automatic differentiation variational inference (advi)
Open Source Code Yes We implement and deploy advi as part of Stan, a probabilistic programming system (Stan Development Team, 2016). ... Appendix E. Running advi in Stan. Visit http://mc-stan.org/ to download the latest version of Stan. Follow instructions on how to install Stan.
Open Datasets Yes A dataset of trajectories is publicly available: it contains all 1.7 million taxi rides taken during the year 2014 (European Conference of Machine Learning, 2015). ... We use the Frey Faces dataset, which contains 1956 frames (28 20 pixels) of facial expressions extracted from a video sequence. ... We explore the imageclef dataset, which has 250 000 images (Villegas et al., 2013). ... a polling dataset from the United States 1988 presidential election (Gelman and Hill, 2006).
Dataset Splits Yes Linear regression with ard...We use 10 000 data points for training and withhold 1000 for evaluation. ... Logistic regression with a spatial hierarchical prior...We use 10 000 data points for training and withhold 1536 for evaluation. ... Gaussian Mixture Model...We withhold 10 000 images for evaluation.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. It mentions the general speed of the methods but not the underlying hardware.
Software Dependencies Yes We implement and deploy advi as part of Stan, a probabilistic programming system (Stan Development Team, 2016). ... The first is in Py MC3 (Salvatier et al., 2016), a probabilistic programming package, that implements advi in Python using Theano. The second is in Edward (Tran et al., 2016a), a Python library for probabilistic modeling, inference, and criticism, that implements advi in Python using Tensor Flow.
Experiment Setup Yes A single sample suffices. (We set M = 1 from here on.) ... The results in Figure 10a use a0 = b0 = c0 = d0 = 1 as hyper-parameters for the Gamma priors. ... The regression coefficient β has a Normal(0, 10) prior and all standard deviation latent variables have half Normal(0, 10) priors. ... We set K = 10 and all the Gamma hyper-parameters to 1 in our experiments. ... We set K = 10, α0 = 1000 for each component, and λ0 = 0.1. ... With a minibatch size of 500 or larger, advi reaches high predictive accuracy. ... We set ϵ = 10 16, a small value that guarantees that the step-size sequence satisfies the Robbins and Monro (1951) conditions. The weighting factor α (0, 1) defines a compromise of old and new gradient information, which we set to 0.1. ... we set τ = 1. ... We adaptively tune η by searching over η {0.01, 0.1, 1, 10, 100} using a subset of the data and selecting the value that leads to the fastest convergence (Bottou, 2012).