An Optimization-centric View on Bayes' Rule: Reviewing and Generalizing Variational Inference

Authors: Jeremias Knoblauch, Jack Jewson, Theodoros Damoulas

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We explore applications of GVI posteriors, and show that they can be used to improve robustness and posterior marginals on Bayesian Neural Networks and Deep Gaussian Processes. ... Section 6: We demonstrate GVI on two large-scale inference applications: Bayesian Neural Networks (BNNs) and Deep Gaussian Processes (DGPs). ... The results are depicted in Figure 13 and confirm our two main intuitions about robustness: Firstly, the robust scoring rule provides a significant performance improvement. Secondly, the smaller value of γ (which will be closer to the log score) generally outperforms the larger value of γ, though both choices are equally good in many data sets13.
Researcher Affiliation Academia Jeremias Knoblauch EMAIL The Alan Turing Institute Dept. of Statistics University of Warwick Coventry, CV4 7AL, UK Jack Jewson EMAIL The Alan Turing Institute Dept. of Statistics University of Warwick Coventry, CV4 7AL, UK Theodoros Damoulas EMAIL The Alan Turing Institute Depts. of Computer Science & Statistics University of Warwick Coventry, CV4 7AL, UK
Pseudocode Yes Algorithm 1 Black box GVI (BBGVI) Input: x1:n, π, D, ℓ, Q, h, Stopping Criterion, κ0, K, S, t = 0, Learning Rate done False while not done do // STEP 1: Get a subsample from x1:n of size K ρ1:K Sample Without Replacement(1 : n, K) x(t)1:K xρ1:K // STEP 2: Sample from q(θ|κt) and compute losses θ(1:S) i.i.d. q(θ|κt) ℓi,s ℓ(θ(s), x(t)i) κt log q(θ(s)|κt) for all s = 1, 2, . . . S and i = 1, 2, . . . , K ℓs n K PK i=1 ℓi,s for all s = 1, 2, . . . S // STEP 3: Compute divergence term if D(q π) admits closed form then ℓs ℓs + κD(q π) for all s = 1, 2, . . . S else if D(q π) = Eq[ℓD κ,π(θ)] then ℓs ℓs + ℓD κ,π(θ(s)) κt log q(θ(s)|κt) + κtℓD κt,π(θ(s)) for all s = 1, 2, . . . S else if D(q π) = τ Eq[ℓD κ,π(θ)] then ℓs ℓs + τ 1 S PS s=1 ℓD κ,π(θ(s)) κtℓD κt,π(θ(s)).
Open Source Code Yes All code used for generating the experiments is available from https://github.com/Jeremias Knoblauch/GVIPublic.
Open Datasets Yes We use the same settings, meaning that all experiments use 20,000 iterations of the ADAM optimizer (Kingma and Ba, 2014) with a learning rate of 0.01 and default settings for all other hyperparameters. We perform inference for each of the UCI data sets (Lichman, 2013) after normalization using the RBF kernel with dimension-wise lengthscales, 100 inducing points, with batch sizes of min(1000, n) and Dl = min(Dx, 30).
Dataset Splits Yes Using 50 random splits of the relevant data into training (90%) and test (10%) sets, the inferred models are evaluated predictively on the test sets using the average negative log likelihood (NLL) as well as the average root mean square error (RMSE). ... As before, we use 50 random splits with 90% training and 10% test data to assess predictive performance in terms of negative log likelihood (NLL) and root mean square error (RMSE).
Hardware Specification No No specific hardware details (like GPU models, CPU models, or cloud instance types) were explicitly mentioned for running the experiments.
Software Dependencies No Our implementation is built on top of that used for the results of Li and Turner (2016) and only changes the objective being optimized. Similarly, all settings and data sets for which the methods are compared are unchanged and taken directly from Li and Turner (2016) and Hern andez-Lobato et al. (2016): We use a single-layer network with 50 Re LU nodes on all experiments. Inference is performed via probabilistic back-propagation (Hern andez-Lobato and Adams, 2015) and the ADAM optimizer (Kingma and Ba, 2014) with its default settings, 500 epochs and a batch size of 32. ... As with the experiments on BNNs in the previous section, we make comparisons as fair as possible by using the gpflow (Matthews et al., 2017) implementation of Salimbeni and Deisenroth (2017).
Experiment Setup Yes We use a single-layer network with 50 Re LU nodes on all experiments. Inference is performed via probabilistic back-propagation (Hern andez-Lobato and Adams, 2015) and the ADAM optimizer (Kingma and Ba, 2014) with its default settings, 500 epochs and a batch size of 32. ... Further, we use the same settings, meaning that all experiments use 20,000 iterations of the ADAM optimizer (Kingma and Ba, 2014) with a learning rate of 0.01 and default settings for all other hyperparameters. We perform inference for each of the UCI data sets (Lichman, 2013) after normalization using the RBF kernel with dimension-wise lengthscales, 100 inducing points, with batch sizes of min(1000, n) and Dl = min(Dx, 30).