reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Preferential Multi-Objective Bayesian Optimization

Authors: Raul Astudillo, Kejun Li, Maegan Tucker, Chu Xin Cheng, Aaron Ames, Yisong Yue

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate DSTS across four synthetic test functions and two simulated tasks exoskeleton personalization and driving policy design demonstrating that it outperforms several benchmarks. Finally, we prove that DSTS is asymptotically consistent. Along the way, we provide, to our knowledge, the first convergence guarantee for dueling Thompson sampling in single-objective PBO.
Researcher Affiliation	Academia	Raul Astudillo1 EMAIL Kejun Li1 EMAIL Maegan Tucker2 EMAIL Chu Xin Cheng1 EMAIL Aaron D. Ames1 EMAIL Yisong Yue1 EMAIL 1California Institute of Technology 2Georgia Institute of Technology
Pseudocode	Yes	Algorithm 1 Dueling Scalarized Thompson Sampling
Open Source Code	Yes	The code for reproducing our experiments can be found at https://github.com/Raul Astudillo06/PMBO.
Open Datasets	Yes	DTLZ1 and DTLZ2 The DTLZ1 and DTLZ2 functions are standard test problems from the multiobjective optimization literature (Deb et al., 2005).
Dataset Splits	No	In all problems, an initial dataset is obtained using 2(d + 1) queries chosen uniformly at random over Xq, where d is the input dimension of the problem. After this initial stage, each algorithm was used to select 100 additional queries sequentially. Since simulations are time-consuming, we build surrogate objectives by fitting a (regular) Gaussian process to the objectives obtained from 1000 simulations, with each set of gait features drawn uniformly over the design space. The paper describes data generation and surrogate model building, but not explicit train/test/validation splits for reproduction of an existing dataset.
Hardware Specification	No	The paper does not provide specific hardware details used for running its experiments.
Software Dependencies	No	All algorithms are implemented using Bo Torch (Balandat et al., 2020). The paper mentions Bo Torch as a framework and cites its publication year, but does not provide a specific version number (e.g., Bo Torch 1.x).
Experiment Setup	Yes	In all problems, an initial dataset is obtained using 2(d + 1) queries chosen uniformly at random over Xq, where d is the input dimension of the problem. After this initial stage, each algorithm was used to select 100 additional queries sequentially. Results for q = 2 are shown in Figure 3. Each experiment was replicated 30 times using different initial datasets. In all problems, the DM’s responses are corrupted by moderate levels of Gumbel noise, which is consistent with the use of a Logistic likelihood (see Appendix B.2 for the details). In each problem, for every objective, λtrue j is chosen such that, on average, the DM makes a mistake 20% of the time when comparing pairs of designs among those with the top 1% objective values within X.