reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Revolve: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization

Authors: Peiyan Zhang, Haibo Jin, Leyang Hu, Xinnuo Li, Liying Kang, Man Luo, Yangqiu Song, Haohan Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across three tasks demonstrate the adaptability and efficiency of our proposal. Beyond its practical contributions, REVOLVE highlights a promising direction, where the rich knowledge from established optimization principles can be leveraged to enhance LLM systems, which paves the way for further advancements in this hybrid domain.
Researcher Affiliation	Collaboration	1Hong Kong University of Science and Technology 2University of Illinois at Urbana-Champaign 3Brown University 4University of Michigan Ann Arbor 5Hong Kong Polytechnic University 6Intel Lab. Correspondence to: Haohan Wang <EMAIL>.
Pseudocode	No	The paper describes the method using mathematical equations and narrative text (e.g., "Forward Pass", "Language Loss Computation", "Backward Pass"), but it does not contain any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code is available at: https://llm-revolve.netlify.app.
Open Datasets	Yes	For prompt optimization, we use the Big Bench Hard dataset (Suzgun et al., 2022) for Object Counting and the GSM8K dataset (Cobbe et al., 2021) for grade-school math problems. In solution optimization, we assess performance on the Google-proof Question Answering (GPQA) benchmark (Rein et al., 2023), which consists of expert-level multiple-choice questions, and the Machine Learning and College Physics subsets of MMLU (Hendrycks et al., 2020), a benchmark evaluating LLMs human-level performance. For code optimization, we use the Leet Code Hard dataset (Shinn et al., 2024), which includes complex coding problems challenging both humans and models.
Dataset Splits	Yes	For the dataset split, we follow the settings used in Text Grad (Yuksekgonul et al., 2024). The Big Bench Hard Object Counting dataset is divided into 50/100/100 samples for train/validation/test, respectively. For GSM8K, we adopt the split from DSPy (Khattab et al., 2024), using 200/300/1399 samples for train/validation/test. In each task, we limit the training set to 36 samples, consistent with the Text Grad setup. Example queries for each dataset are shown below: [...] We use three iterations of optimization for each question when using the iterative optimization methods.
Hardware Specification	Yes	We use Llama 3.1 8B Instruct as the base LLM, running on a setup with 4 NVIDIA 3090 GPUs.
Software Dependencies	No	The paper mentions several LLMs used in the experiments (e.g., gpt-3.5-turbo-0125, GPT-4-0125-preview, Gemini 1.5 Pro, Llama 3.1 8B Instruct), which are models or services. However, it does not provide specific version numbers for underlying software libraries, frameworks (like PyTorch or TensorFlow), or other ancillary software components required for reproduction.
Experiment Setup	Yes	For all LLMs used, we allow a maximum of 2000 tokens, and use a top-p of 0.99. [...] Regarding specific hyperparameters for the LLMs, we set the temperature to 0 (1 × 10−6 for Llama 3.1 8B Instruct), allow a maximum of 2000 new tokens, and use a top-p value of 0.99. [...] For each task, when using the iterative optimization methods, we use a batch size of 3 across 12 optimization iterations, allowing the model to process a total of 36 training examples, randomly sampled with replacement. After each iteration, we validate the prompt using a validation set, and if the validation accuracy improves, we update the prompt accordingly. [...] We use three iterations of optimization for each question when using the iterative optimization methods.