U-Statistics for Importance-Weighted Variational Inference
Authors: Javier Burroni, Kenta Takatsu, Justin Domke, Daniel Sheldon
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find empirically that U-statistic variance reduction can lead to modest to significant improvements in inference performance on a range of models, with little computational cost. We demonstrate on a diverse set of inference problems that U-statistic-based variance reduction for the IW-ELBO either does not change, or leads to modest to significant gains in black-box VI performance, with no substantive downsides. We empirically show that U-statistic-based estimators also reduce variance during IWAE training and lead to models with higher training objective values when used with either the standard gradient estimator or the doubly-reparameterized gradient (DRe G) estimator (Tucker et al., 2018). For black-box IWVI, we experiment with two kinds of models: Bayesian logistic regression with 5 different UCI datasets (Dua & Graff, 2017) using both diagonal and full covariance Gaussian variational distributions, and a suite of 12 statistical models from the Stan example models (Stan Development Team, 2021; Carpenter et al., 2017) |
| Researcher Affiliation | Academia | Javier Burroni EMAIL University of Massachusetts Amherst Kenta Takatsu EMAIL Carnegie Mellon University Justin Domke EMAIL University of Massachusetts Amherst Daniel Sheldon EMAIL University of Massachusetts Amherst |
| Pseudocode | No | The paper defines estimators (Estimator 1, Estimator 2, etc.) and theoretical propositions, but it does not include any clearly labeled pseudocode blocks or algorithms in a structured, code-like format. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a direct link to a code repository. It mentions using PyTorch and Pyro's DRe G implementation but does not offer its own code. |
| Open Datasets | Yes | For black-box IWVI, we experiment with two kinds of models: Bayesian logistic regression with 5 different UCI datasets (Dua & Graff, 2017) using both diagonal and full covariance Gaussian variational distributions, and a suite of 12 statistical models from the Stan example models (Stan Development Team, 2021; Carpenter et al., 2017). To evaluate the performance of the proposed methods on IWAEs, we trained IWAEs on 4 different datasets: MNIST, KMNIST, FMNIST, and Omniglot. |
| Dataset Splits | Yes | MNIST 784 60000 + 10000 Le Cun et al. (2010) FMNIST 784 60000 + 10000 Fashion-MNIST, Xiao et al. (2017) KMNIST 784 60000 + 10000 Kuzushiji-MNIST Clanuwat et al. (2018) Omniglot 784 24345 + 8070 Lake et al. (2015) from Burda et al. (2016) |
| Hardware Specification | No | To get consistent wall-clock time measurements, we trained only using CPU on dedicated servers, with disabled hyper-threading and a single task per core. This statement describes the general environment but lacks specific CPU models, clock speeds, or other detailed hardware specifications. |
| Software Dependencies | No | We used SGD with 15 different learning rates... We used the reparameterization gradient estimator as the base gradient estimator, and also provide in Appendix D and G (very similar) results for the doubly-reparameterized (DRe G) gradient estimator. For a randomly-sampled Dirichlet Distribution with 50 parameters, we approximate it using a (50 1)-dimensional Gaussian distribution parameterized with a full rank covariance matrix, with its domain constrained to the simplex using Py Torch s distributions (Paszke et al., 2019). We trained each combination of dataset, method, and value of m using five different random seeds, and the optimization was run for 100 epochs using Adam (Kingma & Ba, 2015). Our implementation of DRe G is based on Pyro s (Bingham et al., 2018) not-yet-integrated implementation. |
| Experiment Setup | Yes | For each model, the variational parameters were optimized using stochastic gradient descent with fixed learning rate for 15 different logarithmically spaced learning rates. We used n = 16 samples per iteration except for the running time analysis, and experimented with m {2, 4, 8}. To evaluate the performance of the proposed methods on IWAEs, we trained IWAEs on 4 different datasets: MNIST, KMNIST, FMNIST, and Omniglot. We compare the standard IW-ELBO estimator and DRe G estimators to their permuted versions, i.e., the permuted and permuted-DRe G estimators. We also evaluate the secondorder approximation to the complete-U-statistic estimator. We trained each combination of dataset, method, and value of m using five different random seeds, and the optimization was run for 100 epochs using Adam (Kingma & Ba, 2015). In all cases, we used a batch size of 500, and a latent variable of dimension 50, while taking n = 50 samples. Datasets were taken from Py Torch, except for the Omniglot, for which we used the construction provided by Burda et al. (2016). We evaluated using the standard IW-ELBO estimator, regardless of the estimator used for the optimization. ... with a learning rate of 10^-4. |