reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

Authors: Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Tran, Seyed Mehran Kazemi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we revisit whether this strategy is compute-optimal under a fixed inference budget (e.g., FLOPs). To do so, we investigate the trade-offs between generating synthetic data using a stronger but more expensive (SE) model versus a weaker but cheaper (WC) model. We evaluate the generated data across three key metrics: coverage, diversity, and false positive rate, and show that the data from WC models may have higher coverage and diversity, but also exhibit higher false positive rates. We then finetune LMs on data from SE and WC models in different settings: knowledge distillation, self-improvement, and a novel weak-to-strong improvement setup where a weaker LM teaches reasoning to a stronger LM. Our findings reveal that models finetuned on WC-generated data consistently outperform those trained on SE-generated data across multiple benchmarks and multiple choices of WC and SE models.
Researcher Affiliation	Collaboration	Hritik Bansal1,2, Arian Hosseini1,3, Rishabh Agarwal1,3, Vinh Q. Tran1, Mehran Kazemi1 1 Google Deep Mind, 2 UCLA, 3 Mila Correspondence: EMAIL and EMAIL
Pseudocode	No	The paper describes methods and algorithms in paragraph text and mathematical equations (e.g., Equation 1) but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor any structured, code-like steps for procedures.
Open Source Code	No	In this paper, we generated synthetic data either using open-weight language models (Gemma2 family and Llama), or models that are publicly available through API calls (Gemini 1.5 family). We also used publicly available datasets, MATH and GSM-8K. The data generation process is detailed in K. Additionally, we focus our finetuning experiments to open-weight Gemma models (7B, 9B, and 27B) only, with the finetuning details provided in Appendix J. Finally, the evaluation details are also covered in 4. The paper mentions using existing 'open-weight' or 'publicly available' models, but does not state that the authors are providing their own implementation code for the methodology described in the paper. There are no links to code repositories or explicit statements about code release for their specific finetuning or sampling methodology.
Open Datasets	Yes	Datasets: We mainly experiment with MATH (Hendrycks et al., 2021) and GSM-8K (Cobbe et al., 2021) datasets, which are widely adopted in the literature.
Dataset Splits	Yes	Each dataset contains 7500 math problems in their training split. We evaluate the models on 500 problems from the MATH test split (Lightman et al., 2023) and 1319 problems from the GSM-8K test split. Further, we use 500 problems from the MATH test split and 500 problems from GSM-8K as the validation dataset.
Hardware Specification	No	The paper discusses FLOPs and parameters but explicitly states regarding hardware: "Note that this may also depend on the available hardware, which we ignore in this work." No specific hardware details (GPU models, CPU types, etc.) are provided for running the experiments.
Software Dependencies	No	The paper mentions using "open-weight language models (Gemma2 family and Llama), or models that are publicly available through API calls (Gemini 1.5 family)". However, it does not specify any particular software libraries or tools with their version numbers (e.g., Python, PyTorch, TensorFlow versions) that were used to implement the methodology.
Experiment Setup	Yes	We generated the candidate solutions in the synthetic dataset using Top K (K= 3) strategy with a temperature 0.7. We finetuned the Gemma2-9B and Gemma2-27B models with a batch size of 32 for 600 and 6000 steps under the low and high sampling budget, respectively. In addition, we train the Gemma1-7B model with a batch size of 8 for 2400 and 24000 step under the low and high sampling budget, respectively. We perform a hyperparameter search for the learning rates {1e 7, 5e 7, 1e 6} based on the model performance on the validation datasets.