reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BILBO: BILevel Bayesian Optimization

Authors: Ruth Wan Theng Chew, Quoc Phong Nguyen, Bryan Kian Hsiang Low

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The performance of BILBO is theoretically guaranteed with a sublinear regret bound for commonly used kernels and is empirically evaluated on several synthetic and real-world problems. We evaluate the performance of BILBO on 4 synthetic and 2 real-world problems. We introduce 2 baselines for comparison: Trusted Rand and Nested. Results are averaged over 5 runs and performance is compared by examining the instantaneous regret against query count with 95% confidence intervals.
Researcher Affiliation	Collaboration	1Institute of Data Science, National University of Singapore, Singapore 2Amazon, Australia 3Department of Computer Science, National University of Singapore, Singapore. Correspondence to: Ruth Wan Theng Chew <EMAIL>.
Pseudocode	Yes	The key components are illustrated in Figure 2 and the pseudocode is in Algorithm 1.
Open Source Code	Yes	The code is provided in https://github.com/chewwt/bilbo/, and a notation table is in Appendix A.
Open Datasets	Yes	SMD2, SMD6, and SMD12 are adapted from the SMD suite of test problems for bilevel optimization (Sinha et al., 2014). We simulated a bilevel energy market problem... simulated using Py PSA (Brown et al., 2018). We used COCO simulator to simulate carbonylation of Di-Methyl Ether (DME)... adapted from the flowsheet provided by Chem Sep.
Dataset Splits	No	The paper describes a Bayesian Optimization algorithm, which iteratively selects query points rather than operating on predefined, static training/test/validation splits of a dataset. It mentions "All experiments, except Nested, are initialized with 3 observations on each function." as initial data, but no explicit dataset splits.
Hardware Specification	Yes	The experiments in this paper were done on a computer with AMD Ryzen 7 5700X 8-Core Processor and 64 GB of RAM, unless otherwise specified. In terms of wall-clock time, on a Mac Studio with M2 Ultra over 5 runs and 40 seconds total runtime, for the 2-dimensional Branin Hoo+Goldstein Price experiment
Software Dependencies	No	Algorithms are implemented using Gpy Torch (Gardner et al., 2018). GP with Mat ern 5/2 kernel was used... Nested uses the sequential least squares programming (SLSQP) optimizer... simulated using Py PSA (Brown et al., 2018). We used COCO simulator to simulate... adapted from the flowsheet provided by Chem Sep. While multiple software components are named, none include specific version numbers.
Experiment Setup	Yes	All observations are noisy with σn = 0.01, and outputs are normalized to have mean 0 and standard deviation 1. GP with Mat ern 5/2 kernel was used, and the GP hyperparameters were automatically tuned at each iteration using maximum likelihood estimation on the past observations. The hyperparameters include length scale and prior mean. The prior mean initialized to 0 for all experiments since the output is already normalized. The initial length scale and other parameters for each experiment are set according to Table 1.