reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hyperband-based Bayesian Optimization for Black-box Prompt Selection

Authors: Lennart Schneider, Martin Wistuba, Aaron Klein, Jacek Golebiowski, Giovanni Zappella, Felice Antonio Merra

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across ten diverse benchmarks and three LLMs demonstrate that Hb Bo Ps outperforms state-of-the-art methods in both performance and efficiency.
Researcher Affiliation	Collaboration	1Work done during an internship at Amazon Web Services, Berlin, Germany. 2Amazon Web Services, Berlin, Germany. 3Sca DS.AI, University of Leipzig, Germany; work done while at Amazon. 4distil labs, Berlin, Germany; work done while at Amazon. 5Cognism, Remote, Italy; work done while at Amazon.
Pseudocode	Yes	Algorithm 1 Hb Bo Ps
Open Source Code	No	The paper states: "We implement our Hb Bo Ps in GPy Torch (Gardner et al., 2018) and run it as described in Section 4." However, it does not provide an explicit statement of code release for Hb Bo Ps or a link to its repository. It mentions using publicly available code bases for baselines, but not for their own method.
Open Datasets	Yes	We benchmark Hb Bo Ps on ten tasks commonly used for LLM evaluation (Zhou et al., 2023; Lin et al., 2024; Chen et al., 2024; Wu et al., 2024; Shi et al., 2024). AI2 s Reasoning Challenge (ARC) (Clark et al., 2018): multiple-choice Q&A problems; Grade School Math 8K (Cobbe et al., 2021): math problems taking between two and eight steps to solve; Eight Tasks from the BBII subset of the BIG-bench and instruction induction benchmarks (Srivastava et al., 2023; Honovich et al., 2023) used in Zhou et al. (2023); Wu et al. (2024); Shi et al. (2024): antonyms, larger animal, negation, second word letter, sentiment, object counting, orthography starts with, and word unscrambling.
Dataset Splits	Yes	For AI2 ARC, we use the official train, validation and test splits from AI2 s Reasoning Challenge. [...] GSM8K officially only contains a train and test split. We sampled 1319 instances from the train split uniformly at random to create a validation set of comparable size to the test set. For all other tasks from the BBII subset of the BIG-bench and instruction induction benchmarks, we use the splits as proposed by Wu et al. (2024). [...] Table 5. Characteristics of tasks used in the experiments. Task Setting ntrain nvalid ntest
Hardware Specification	No	The paper specifies the LLMs used (Claude 3 Haiku, LLAMA3 8B Instruct, and Mistral 7B Instruct) and their hyperparameters, but does not provide any details about the hardware (e.g., GPU, CPU models, or cloud computing resources) on which these LLMs were run or the experiments were conducted.
Software Dependencies	No	The paper mentions software libraries like "Bo Torch (Balandat et al., 2020)", "GPy Torch (Gardner et al., 2018)", "DSPy", and "TRIPLE", along with the optimizer "Adam W (Loshchilov & Hutter, 2019)". However, it only cites the papers describing these tools or general library names, without providing specific version numbers for the software actually used in their implementation (e.g., PyTorch 1.9, Python 3.8).
Experiment Setup	Yes	All full-fidelity BO methods (vanilla BO, HDBO, BOPCA) use an ARD Matérn 5/2 kernel and Expected Improvement as acquisition function and normalize inputs to the unit cube and standardize outputs to have zero mean and unit variance. [...] Hb Bo Ps uses an ARD Matérn 5/2 kernel, normalizes inputs to the unit cube and standardizes outputs. [...] To optimize the log marginal likelihood in Equation (6), we use Adam W (Loshchilov & Hutter, 2019) with learning rate = 0.01, maximum number of epochs = 3000, and early termination with a patience = 10. Within the HB schedule, we use a lower limit on the number of validation instances bmin = 10 and a halving parameter η = 2.0. [...] Claude 3 Haiku: max tokens = 200, temperature = 0.5, top p = 1.0, top k = 250; LLAMA3 8B Instruct: max tokens = 512, temperature = 0.5, top p = 0.9; Mistral 7B Instruct: max tokens = 512, temperature = 0.5, top p = 0.9, top k = 50. For GSM8K we increase max tokens for all LLMs to 1024. [...] we perform random interleaving as described in Falkner et al. (2018) for each proposal with a probability of ρ = 0.1.