Bayesian Optimization via Continual Variational Last Layer Training

Authors: Paul Brunzema, Mikkel Jordahn, John Willes, Sebastian Trimpe, Jasper Snoek, James Harrison

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we propose an approach which shows competitive performance on many problem types, including some that BNNs typically struggle with. We build on variational Bayesian last layers (VBLLs), and connect training of these models to exact conditioning in GPs. We exploit this connection to develop an efficient online training algorithm that interleaves conditioning and optimization. Our findings suggest that VBLL networks significantly outperform GPs and other BNN architectures on tasks with complex input correlations, and match the performance of well-tuned GPs on established benchmark tasks. ... Figure 3: Classic benchmarks (top) and high-dimensional and non-stationary benchmarks (bottom). Performance of all surrogates for log EI (top) and TS (bottom). ... Figure 5: Multi-objective benchmarks. Performance of all surrogate models using log EHVI and VBLLs with TS.
Researcher Affiliation Collaboration 1RWTH Aachen University, 2Technical University of Denmark, 3Vector Institute, 4Google Deep Mind
Pseudocode Yes Algorithm 1 VBLL Bayesian Optimization Loop with Continual Variational Last Layer Training
Open Source Code No The paper states: "We implement the VBLLs within Bo Torch (Balandat et al., 2020). We further build on the implementation of Li et al. (2024) for the different baselines which are also based on Bo Torch as well as GPy Torch (Gardner et al., 2018)." This refers to the use of third-party frameworks, not the release of the authors' specific implementation code for this paper's methodology.
Open Datasets Yes We evaluate the performance of the VBLL surrogate model on various standard benchmarks and three more complex optimization problems... Our results on the 200D NNdraw benchmark (Li et al., 2024), the real-world 25D Pestcontrol benchmark (Oh et al., 2019), and the 12D Lunarlander benchmark (Eriksson et al., 2019) are shown in Figure 3 (bottom). ... Here, we consider the standard benchmarks Branin Currin (D = 2, K = 2), DTLZ1 (D = 5, K = 2), DTLZ2 (D = 5, K = 2), and the real world benchmark Oil Sorbent (D = 7, K = 3) (Wang et al., 2020; Li et al., 2024).
Dataset Splits No The paper does not provide specific train/test/validation dataset splits. In Bayesian Optimization, data is acquired sequentially rather than being split from a pre-existing dataset. The paper mentions initial points for benchmarks: "In all subsequent experiments, we select the number of initial points for the single objective benchmarks equal to the input dimensionality D and for the multi-objective benchmarks we use 2 (D + 1) initial points (Daulton et al., 2020; Balandat et al., 2020)." This describes initialization for the BO process, not dataset splits.
Hardware Specification No The paper states in the acknowledgments: "Simulations were performed in part with computing resources granted by RWTH Aachen University under projects rwth1579 and p0022034." This is a general statement about computing resources and does not provide specific hardware details (e.g., CPU/GPU models, memory).
Software Dependencies No The paper mentions: "We implement the VBLLs within Bo Torch (Balandat et al., 2020)... as well as GPy Torch (Gardner et al., 2018)." and "For all experiments, we use Adam W (Loshchilov & Hutter, 2017) as our optimizer". However, it does not provide specific version numbers for Bo Torch, GPy Torch, or AdamW, which are required for reproducible software dependencies.
Experiment Setup Yes For training the VBLL models, we closely follow Harrison et al. (2024). For all experiments, we use Adam W (Loshchilov & Hutter, 2017) as our optimizer with a learning rate of 10-3, set the weight decay for the backbone (not including the parameters of the VBLL) to 10-4, and use norm-based gradient clipping with a value of 1. For the VBLL, we set the prior scale to 1 and the the Wishart scale to 0.01. ... We track the average loss of a training epoch and if this average loss does not improve for a 100 epochs in a row, we stop training and use the model parameters that yielded the lowest training loss.