Searching for Optimal Solutions with LLMs via Bayesian Optimization

Authors: Dhruv Agarwal, Manoj Ghuhan Arivazhagan, Rajarshi Das, Sandesh Swamy, Sopan Khosla, Rashmi Gangadharaiah

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on word search, molecule optimization, and a joint hypothesis+program search task using a 1-D version of the challenging Abstraction and Reasoning Corpus (1D-ARC). Our results show that BOPRO outperforms all baselines in word search ( 10 points) and molecule optimization (higher quality and 17% fewer invalid molecules), but trails a best-k prompting strategy in program search.
Researcher Affiliation Collaboration 1University of Massachusetts Amherst, 2AWS AI Labs, 3Meta AI
Pseudocode No The paper describes the Bayesian-OPRO (BOPRO) methodology in Section 5, detailing its components and processes in prose, but it does not include a distinct pseudocode or algorithm block.
Open Source Code Yes 1Our code is available at: https://github.com/amazon-science/BOPRO-ICLR-2025.
Open Datasets Yes Dockstring (Garc ıa-Orteg on et al., 2022) provides a benchmark of challenging molecule optimization tasks... In this work, we use a 1-dimensional version of the dataset, 1D-ARC, introduced by Xu et al. (2024).
Dataset Splits Yes Following Wang et al. (2024), we run a preliminary evaluation on their filtered subset of 108 problem instances. [...] We, therefore, construct 1D-ARC-Hard, a challenging subset suitable for evaluating search performance by first running RS for 100 generations for each of the 901 problems in the original dataset, followed by sampling 130 instances from the unsolved subset of 175 problems.
Hardware Specification No The paper mentions using 'AWS Bedrock APIs to access the LLMs used in this work' but does not provide specific hardware details such as GPU/CPU models or detailed cloud instance types for running experiments.
Software Dependencies No The paper mentions using software libraries such as 'Huggingface transformers', 'AWS Bedrock APIs', 'sentence-transformers', 'Bo Torch', and 'GPy Torch', but it does not specify explicit version numbers for these software dependencies, only citing their original papers.
Experiment Setup Yes In Appendix A.4.1, the paper provides detailed task-specific settings for Semantle, Dockstring, and 1D-ARC, including 'Number of warm-start candidates', 'number of solution generations', 'BO optimization batch size', 'BO decoding batch size', 'number of in-context examples', 'repeat retries', and the 'representation model'. Additionally, in Appendix A.4.3, 'LLM AND EMBEDDINGS SETUP', it specifies 'decoding parameters used for sampling solutions from the LLM are temperature=1.0, top p=0.9, max new tokens=512'.