reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Searching for Optimal Solutions with LLMs via Bayesian Optimization

Authors: Dhruv Agarwal, Manoj Ghuhan Arivazhagan, Rajarshi Das, Sandesh Swamy, Sopan Khosla, Rashmi Gangadharaiah

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on word search, molecule optimization, and a joint hypothesis+program search task using a 1-D version of the challenging Abstraction and Reasoning Corpus (1D-ARC). Our results show that BOPRO outperforms all baselines in word search ( 10 points) and molecule optimization (higher quality and 17% fewer invalid molecules), but trails a best-k prompting strategy in program search.
Researcher Affiliation	Collaboration	1University of Massachusetts Amherst, 2AWS AI Labs, 3Meta AI
Pseudocode	No	The paper describes the Bayesian-OPRO (BOPRO) methodology in Section 5, detailing its components and processes in prose, but it does not include a distinct pseudocode or algorithm block.
Open Source Code	Yes	1Our code is available at: https://github.com/amazon-science/BOPRO-ICLR-2025.
Open Datasets	Yes	Dockstring (Garc ıa-Orteg on et al., 2022) provides a benchmark of challenging molecule optimization tasks... In this work, we use a 1-dimensional version of the dataset, 1D-ARC, introduced by Xu et al. (2024).
Dataset Splits	Yes	Following Wang et al. (2024), we run a preliminary evaluation on their filtered subset of 108 problem instances. [...] We, therefore, construct 1D-ARC-Hard, a challenging subset suitable for evaluating search performance by first running RS for 100 generations for each of the 901 problems in the original dataset, followed by sampling 130 instances from the unsolved subset of 175 problems.
Hardware Specification	No	The paper mentions using 'AWS Bedrock APIs to access the LLMs used in this work' but does not provide specific hardware details such as GPU/CPU models or detailed cloud instance types for running experiments.
Software Dependencies	No	The paper mentions using software libraries such as 'Huggingface transformers', 'AWS Bedrock APIs', 'sentence-transformers', 'Bo Torch', and 'GPy Torch', but it does not specify explicit version numbers for these software dependencies, only citing their original papers.
Experiment Setup	Yes	In Appendix A.4.1, the paper provides detailed task-specific settings for Semantle, Dockstring, and 1D-ARC, including 'Number of warm-start candidates', 'number of solution generations', 'BO optimization batch size', 'BO decoding batch size', 'number of in-context examples', 'repeat retries', and the 'representation model'. Additionally, in Appendix A.4.3, 'LLM AND EMBEDDINGS SETUP', it specifies 'decoding parameters used for sampling solutions from the LLM are temperature=1.0, top p=0.9, max new tokens=512'.