Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
Authors: Yuda Song, Hanlin Zhang, Carson Eisenach, Sham Kakade, Dean Foster, Udaya Ghai
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments with various model families and tasks, we discover a scaling phenomenon of self-improvement a variant of the generation-verification gap scales monotonically with the model pre-training flops. We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance. |
| Researcher Affiliation | Collaboration | Yuda Song CMU, Amazon EMAIL Hanlin Zhang Harvard University EMAIL Carson Eisenach Amazon EMAIL Sham M. Kakade Harvard University, Amazon EMAIL Dean Foster Amazon EMAIL Udaya Ghai Amazon EMAIL |
| Pseudocode | Yes | Algorithm 1 Iterative self-improvement with rejection sampling. |
| Open Source Code | No | The paper mentions using 'lm-evaluation-harness' and 'vLLM', which are third-party tools, but does not provide any explicit statement or link for the authors' own implementation code for the methodology described in this paper. |
| Open Datasets | Yes | We start with the GSM8K benchmark (Cobbe et al., 2021), with 1320 questions on the test data split, and MATH benchmark (Hendrycks et al., 2021), with 5000 questions on the test data split. We measure gap(f) on the Natural Question dataset (Kwiatkowski et al., 2019), where u(x, y) = 1 if y is one of the candidate answers to the question x, and u(x, y) = 0 otherwise. |
| Dataset Splits | Yes | We start with the GSM8K benchmark (Cobbe et al., 2021), with 1320 questions on the test data split, and MATH benchmark (Hendrycks et al., 2021), with 5000 questions on the test data split. Our analysis on a test subset of 3610 questions |
| Hardware Specification | Yes | All our inferences are performed on a cluster of Nvidia A100 40GiB nodes, and our iterative self-improvement training experiments are performed on a cluster of Nvidia A100 80GiB nodes. |
| Software Dependencies | No | The paper states: 'All inference in this paper is performed with vLLM (Kwon et al., 2023).' and 'Our experiment is based on lm-evaluation-harness (Gao et al., 2024).'. While specific software names are mentioned, no version numbers for these or other key software components (e.g., deep learning frameworks) are provided, which is required for reproducibility. |
| Experiment Setup | Yes | For all tasks we use the following setup: for generations and verification, we use sampling parameters p = 0.9, t = 0.7, max length of 512 and 4-shot in-context samples. For each model f, for each prompt x, we sample 128 responses y f(x), and sample 1 verification for each response, which defines the proxy utility score ˆuf(x, y). In Appendix F, Table 8 lists: 'Minibatch size 64 Learning rate 1e-6 Optimizer Adam W Gradient step 2000 Max Sequence Length 2048 Data Type bf16'. |