reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

Authors: Yuda Song, Hanlin Zhang, Carson Eisenach, Sham Kakade, Dean Foster, Udaya Ghai

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments with various model families and tasks, we discover a scaling phenomenon of self-improvement a variant of the generation-verification gap scales monotonically with the model pre-training flops. We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance.
Researcher Affiliation	Collaboration	Yuda Song CMU, Amazon EMAIL Hanlin Zhang Harvard University EMAIL Carson Eisenach Amazon EMAIL Sham M. Kakade Harvard University, Amazon EMAIL Dean Foster Amazon EMAIL Udaya Ghai Amazon EMAIL
Pseudocode	Yes	Algorithm 1 Iterative self-improvement with rejection sampling.
Open Source Code	No	The paper mentions using 'lm-evaluation-harness' and 'vLLM', which are third-party tools, but does not provide any explicit statement or link for the authors' own implementation code for the methodology described in this paper.
Open Datasets	Yes	We start with the GSM8K benchmark (Cobbe et al., 2021), with 1320 questions on the test data split, and MATH benchmark (Hendrycks et al., 2021), with 5000 questions on the test data split. We measure gap(f) on the Natural Question dataset (Kwiatkowski et al., 2019), where u(x, y) = 1 if y is one of the candidate answers to the question x, and u(x, y) = 0 otherwise.
Dataset Splits	Yes	We start with the GSM8K benchmark (Cobbe et al., 2021), with 1320 questions on the test data split, and MATH benchmark (Hendrycks et al., 2021), with 5000 questions on the test data split. Our analysis on a test subset of 3610 questions
Hardware Specification	Yes	All our inferences are performed on a cluster of Nvidia A100 40GiB nodes, and our iterative self-improvement training experiments are performed on a cluster of Nvidia A100 80GiB nodes.
Software Dependencies	No	The paper states: 'All inference in this paper is performed with vLLM (Kwon et al., 2023).' and 'Our experiment is based on lm-evaluation-harness (Gao et al., 2024).'. While specific software names are mentioned, no version numbers for these or other key software components (e.g., deep learning frameworks) are provided, which is required for reproducibility.
Experiment Setup	Yes	For all tasks we use the following setup: for generations and verification, we use sampling parameters p = 0.9, t = 0.7, max length of 512 and 4-shot in-context samples. For each model f, for each prompt x, we sample 128 responses y f(x), and sample 1 verification for each response, which defines the proxy utility score ˆuf(x, y). In Appendix F, Table 8 lists: 'Minibatch size 64 Learning rate 1e-6 Optimizer Adam W Gradient step 2000 Max Sequence Length 2048 Data Type bf16'.