reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GReaTer: Gradients Over Reasoning Makes Smaller Language Models Strong Prompt Optimizers

Authors: Sarkar Snigdha Sarathi Das, Ryo Kamoi, Bo Pang, Yusen Zhang, Caiming Xiong, Rui Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations across diverse reasoning tasks including BBH, GSM8k, and FOLIO demonstrate that GREATER consistently outperforms previous state-of-the-art prompt optimization methods... In this section, we demonstrate that GREATER is highly effective in prompt optimization delivering substantial performance improvement across different tasks. Section 5.1 describes the experiment setup. Section 5.2 presents the main results of the GREATER performance with smaller language models. In Section 5.3, we compare GREATER prompts optimized by smaller language models against the prompts optimized by larger proprietary language models using state-of-the-art baseline methods. Section 5.4 performs an ablation study on the effectiveness of gradient over reasoning in GREATER.
Researcher Affiliation	Collaboration	The Pennsylvania State University Salesforce Research EMAIL
Pseudocode	Yes	Algorithm 1 GREATER
Open Source Code	Yes	Code of GREATER is available at: https://github.com/psunlpgroup/Grea Ter.
Open Datasets	Yes	To evaluate the efficacy of our approach, we use GSM8K (Cobbe et al., 2021), Big Bench-Hard (BBH) (Suzgun et al., 2022), and FOLIO (Han et al., 2022) benchmark datasets for diverse reasoning tasks in mathematics, commonsense, and logical reasoning.
Dataset Splits	Yes	For GSM8K, we used 100/100 for train/dev set, and original test set of 1319 size. Then, for BBH datasets, we used 21 selected BBH tasks as in Table 11 and Table 14. This covers almost all types of tasks in BBH dataset... For all tasks we use 50/100/100 train/dev/test splits similar to (Yuksekgonul et al., 2024). Finally for the FOLIO dataset, we used the latest version of FOLIO (Han et al., 2022) for our evaluation... The original validation split (203 rows) are used for the evaluation of FOLIO, whereas 50/100 samples are taken for train and dev set respectively out of the original train split.
Hardware Specification	Yes	We run all our experiments on 2X NVIDIA A100 80GB GPUs.
Software Dependencies	No	The paper mentions using Llama-3-8B-Instruct and Gemma-2-9B-it models but does not specify any software libraries or frameworks with their version numbers that were used for implementation.
Experiment Setup	Yes	As shown in Algorithm 1, we run GREATER for T = 105 steps with k = 10 for top-k, q = 5 and λ = 0.2 (in Eq. 6).