GReaTer: Gradients Over Reasoning Makes Smaller Language Models Strong Prompt Optimizers
Authors: Sarkar Snigdha Sarathi Das, Ryo Kamoi, Bo Pang, Yusen Zhang, Caiming Xiong, Rui Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations across diverse reasoning tasks including BBH, GSM8k, and FOLIO demonstrate that GREATER consistently outperforms previous state-of-the-art prompt optimization methods... In this section, we demonstrate that GREATER is highly effective in prompt optimization delivering substantial performance improvement across different tasks. Section 5.1 describes the experiment setup. Section 5.2 presents the main results of the GREATER performance with smaller language models. In Section 5.3, we compare GREATER prompts optimized by smaller language models against the prompts optimized by larger proprietary language models using state-of-the-art baseline methods. Section 5.4 performs an ablation study on the effectiveness of gradient over reasoning in GREATER. |
| Researcher Affiliation | Collaboration | The Pennsylvania State University Salesforce Research EMAIL |
| Pseudocode | Yes | Algorithm 1 GREATER |
| Open Source Code | Yes | Code of GREATER is available at: https://github.com/psunlpgroup/Grea Ter. |
| Open Datasets | Yes | To evaluate the efficacy of our approach, we use GSM8K (Cobbe et al., 2021), Big Bench-Hard (BBH) (Suzgun et al., 2022), and FOLIO (Han et al., 2022) benchmark datasets for diverse reasoning tasks in mathematics, commonsense, and logical reasoning. |
| Dataset Splits | Yes | For GSM8K, we used 100/100 for train/dev set, and original test set of 1319 size. Then, for BBH datasets, we used 21 selected BBH tasks as in Table 11 and Table 14. This covers almost all types of tasks in BBH dataset... For all tasks we use 50/100/100 train/dev/test splits similar to (Yuksekgonul et al., 2024). Finally for the FOLIO dataset, we used the latest version of FOLIO (Han et al., 2022) for our evaluation... The original validation split (203 rows) are used for the evaluation of FOLIO, whereas 50/100 samples are taken for train and dev set respectively out of the original train split. |
| Hardware Specification | Yes | We run all our experiments on 2X NVIDIA A100 80GB GPUs. |
| Software Dependencies | No | The paper mentions using Llama-3-8B-Instruct and Gemma-2-9B-it models but does not specify any software libraries or frameworks with their version numbers that were used for implementation. |
| Experiment Setup | Yes | As shown in Algorithm 1, we run GREATER for T = 105 steps with k = 10 for top-k, q = 5 and λ = 0.2 (in Eq. 6). |