reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Universal Offline Black-Box Optimization via Learning Language Model Embeddings

Authors: Rong-Xi Tan, Ming Chen, Ke Xue, Yao Wang, Yaoyuan Wang, Fu Sheng, Chao Qian

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate the universality and effectiveness of our proposed methods. Our findings suggest that unifying language model priors and learning string embedding space can overcome traditional barriers in universal BBO, paving the way for general-purpose BBO algorithms. The code is provided at https://github.com/ lamda-bbo/universal-offline-bbo. ... In this section, we empirically study our proposed framework of universal string-based offline BBO on various tasks. We first introduce the experimental settings and the tasks in Section 4.1, and then show the performance of Uni SO and answer several important research questions (RQs) in Section 4.2. ... As shown in Table 1, we find that Uni SO methods which utilize string representation inputs: (1) are capable of solving offline BBO, with most of the final scores exceeding the best score in offline dataset D(best), except for Ant and D Kitty in Uni SO-N; (2) show competitive results against the numeric-input experts, where Uni SO-T achieves the an average rank of 2.000 among the four methods, performing the best on 3 of the 10 tasks and being the runner-up on 4 of 10 tasks, while the best expert, BN + BO (i.e., batch nor-.
Researcher Affiliation	Collaboration	1National Key Laboratory for Novel Software Technology, Nanjing University, China 2School of Artificial Intelligence, Nanjing University, China 3Advanced Computing and Storage Lab, Huawei Technologies Co., Ltd., China. Correspondence to: Ke Xue <EMAIL>, Chao Qian <EMAIL>.
Pseudocode	No	The paper describes methodologies in prose and uses figures (Fig. 1, Fig. 2) to illustrate model architectures. No explicitly labeled 'Pseudocode' or 'Algorithm' blocks, or code-like formatted procedures are present in the main text or appendices.
Open Source Code	Yes	The code is provided at https://github.com/ lamda-bbo/universal-offline-bbo.
Open Datasets	Yes	To validate the effectiveness of these methods, we collect offline BBO tasks and data from open-source academic works for training. ... We consider unconstrained tasks from two benchmarks for offline BBO, the popular Design-Bench (Trabucco et al., 2022) and the recently proposed benchmark SOO-Bench (Qian et al., 2025). ... Design-Bench (Trabucco et al., 2022)10 is a famous benchmark suite for offline BBO. It includes various realistic tasks from real-world optimization problems, and each task corresponds to an oracle function for evaluation and a large static offline dataset. In this paper, we mainly consider 6 tasks in Design-Bench, and we directly use the open-sourced dataset of Design-Bench11 as a part of the training data. 11https://huggingface.co/datasets/beckhamc/design_bench_data
Dataset Splits	Yes	For few-shot setting, we first use the few-shot data to fine-tune the universal regressor using the main loss (i.e., cross-entropy for Uni SO-T and MSE for Uni SO-N) and SGD optimizer with a learning rate of 2 10 5 for 5 epochs, and then search for final designs. We use the data provided by (Wang et al., 2024a) and the poorest 100 pairs of data to construct the few-shot dataset.
Hardware Specification	No	We conduct all our experiments a system with 4 GPUs (total computing power 188 TFLOPS) and a 128-core CPU. ... This mentions general hardware components (GPUs, CPU) and an aggregated computing power (TFLOPS) but does not provide specific models or types for the GPUs or CPU, which are required for a reproducible description.
Software Dependencies	No	Following recent works in string-based LLM for BBO (Song et al., 2024a; Nguyen et al., 2024), we use a lightweight version of encoder-decoder T5 models (Raffel et al., 2020) for both the Uni SO-T and Uni SO-N variants. ... For BO, we adopt BO-q EI for CONTINUOUS problems, which is implemented with Bo Torch (Balandat et al., 2020)7. ... For EAs, we use pymoo (Blank & Deb, 2020)8 for implementation. ... For ES, we use CMA-ES (Hansen, 2016) implemented with evosax (Lange, 2023)9. ... The paper mentions various software tools and libraries like T5 models, Bo Torch, pymoo, and evosax, along with their corresponding research papers and years. However, it does not explicitly state specific version numbers for these software packages themselves (e.g., 'Bo Torch v0.5' or 'pymoo v1.2.3'), which are required for a reproducible description of ancillary software.
Experiment Setup	Yes	For Uni SO-T, we use the T5-based architecture. Pre-training details are listed in Table 3. ... The model is trained for 200 epochs with a batch size of 128. For few-shot fine-tuning, we fine-tune the model for 5 epochs using SGD. During inference, we set temperature to 0.7, apply top-k sampling with k = 20 and nucleus sampling with p = 0.95. For Uni SO-N, ... train the model using Adam W optimizer (Loshchilov & Hutter, 2019) for 200 epochs with a batch size of 128. ... We set the initial α = 0.5, and search for 100 iterations.