U-Statistics for Importance-Weighted Variational Inference

Authors: Javier Burroni, Kenta Takatsu, Justin Domke, Daniel Sheldon

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find empirically that U-statistic variance reduction can lead to modest to significant improvements in inference performance on a range of models, with little computational cost. We demonstrate on a diverse set of inference problems that U-statistic-based variance reduction for the IW-ELBO either does not change, or leads to modest to significant gains in black-box VI performance, with no substantive downsides. We empirically show that U-statistic-based estimators also reduce variance during IWAE training and lead to models with higher training objective values when used with either the standard gradient estimator or the doubly-reparameterized gradient (DRe G) estimator (Tucker et al., 2018). For black-box IWVI, we experiment with two kinds of models: Bayesian logistic regression with 5 different UCI datasets (Dua & Graff, 2017) using both diagonal and full covariance Gaussian variational distributions, and a suite of 12 statistical models from the Stan example models (Stan Development Team, 2021; Carpenter et al., 2017)
Researcher Affiliation Academia Javier Burroni EMAIL University of Massachusetts Amherst Kenta Takatsu EMAIL Carnegie Mellon University Justin Domke EMAIL University of Massachusetts Amherst Daniel Sheldon EMAIL University of Massachusetts Amherst
Pseudocode No The paper defines estimators (Estimator 1, Estimator 2, etc.) and theoretical propositions, but it does not include any clearly labeled pseudocode blocks or algorithms in a structured, code-like format.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a direct link to a code repository. It mentions using PyTorch and Pyro's DRe G implementation but does not offer its own code.
Open Datasets Yes For black-box IWVI, we experiment with two kinds of models: Bayesian logistic regression with 5 different UCI datasets (Dua & Graff, 2017) using both diagonal and full covariance Gaussian variational distributions, and a suite of 12 statistical models from the Stan example models (Stan Development Team, 2021; Carpenter et al., 2017). To evaluate the performance of the proposed methods on IWAEs, we trained IWAEs on 4 different datasets: MNIST, KMNIST, FMNIST, and Omniglot.
Dataset Splits Yes MNIST 784 60000 + 10000 Le Cun et al. (2010) FMNIST 784 60000 + 10000 Fashion-MNIST, Xiao et al. (2017) KMNIST 784 60000 + 10000 Kuzushiji-MNIST Clanuwat et al. (2018) Omniglot 784 24345 + 8070 Lake et al. (2015) from Burda et al. (2016)
Hardware Specification No To get consistent wall-clock time measurements, we trained only using CPU on dedicated servers, with disabled hyper-threading and a single task per core. This statement describes the general environment but lacks specific CPU models, clock speeds, or other detailed hardware specifications.
Software Dependencies No We used SGD with 15 different learning rates... We used the reparameterization gradient estimator as the base gradient estimator, and also provide in Appendix D and G (very similar) results for the doubly-reparameterized (DRe G) gradient estimator. For a randomly-sampled Dirichlet Distribution with 50 parameters, we approximate it using a (50 1)-dimensional Gaussian distribution parameterized with a full rank covariance matrix, with its domain constrained to the simplex using Py Torch s distributions (Paszke et al., 2019). We trained each combination of dataset, method, and value of m using five different random seeds, and the optimization was run for 100 epochs using Adam (Kingma & Ba, 2015). Our implementation of DRe G is based on Pyro s (Bingham et al., 2018) not-yet-integrated implementation.
Experiment Setup Yes For each model, the variational parameters were optimized using stochastic gradient descent with fixed learning rate for 15 different logarithmically spaced learning rates. We used n = 16 samples per iteration except for the running time analysis, and experimented with m {2, 4, 8}. To evaluate the performance of the proposed methods on IWAEs, we trained IWAEs on 4 different datasets: MNIST, KMNIST, FMNIST, and Omniglot. We compare the standard IW-ELBO estimator and DRe G estimators to their permuted versions, i.e., the permuted and permuted-DRe G estimators. We also evaluate the secondorder approximation to the complete-U-statistic estimator. We trained each combination of dataset, method, and value of m using five different random seeds, and the optimization was run for 100 epochs using Adam (Kingma & Ba, 2015). In all cases, we used a batch size of 500, and a latent variable of dimension 50, while taking n = 50 samples. Datasets were taken from Py Torch, except for the Omniglot, for which we used the construction provided by Burda et al. (2016). We evaluated using the standard IW-ELBO estimator, regardless of the estimator used for the optimization. ... with a learning rate of 10^-4.