reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Training Language Models to Self-Correct via Reinforcement Learning

Authors: Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, JD Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	ABSTRACT Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCo Re, that signiﬁcantly improves an LLM s self-correction ability using entirely self-generated data. ... With Gemini 1.0 Pro and 1.5 Flash models, we ﬁnd that SCo Re achieves state-of-the-art self-correction performance, improving the base models self-correction by 15.6% and 9.1% respectively on MATH and Human Eval. ... 6 EXPERIMENTAL EVALUATION The goal of our experiments is to demonstrate the efﬁcacy and justify the design of SCo Re in training LLMs how to self-correct by only training on their own data. To this end, we perform a comparative evaluation of SCo Re against prior methods that also use self-generated data to train for self-correction, and run several ablation studies on two representative reasoning tasks where error correction is crucial.
Researcher Affiliation	Collaboration	Aviral Kumar +, Vincent Zhuang +, Rishabh Agarwal , Yi Su , JD Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay Mc Kinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani , Aleksandra Faust Google Deep Mind Corresponding authors: [vincentzhuang, rishabhagarwal, yisumtv]@google.com, EMAIL
Pseudocode	No	The paper describes the SCo Re method in sections 3, 4, and 5 using prose. While it includes flowcharts (Figures 6 and 11) to illustrate the approach, there are no explicitly labeled pseudocode or algorithm blocks with structured steps in a code-like format.
Open Source Code	No	REPRODUCIBILITY STATEMENT: While we cannot release our ﬁne-tuned models, we hope our detailed descriptions should help the research community replicate our ﬁndings.
Open Datasets	Yes	REPRODUCIBILITY STATEMENT: Our training and evaluations are performed on open-source benchmarks: MATH (Hendrycks et al., 2021), MBPP (Austin et al., 2021), and Human Eval (Chen et al., 2021), with all speciﬁc prompts used in Appendix C. We have also added results with the open Gemma 2 model in Appendix A.1 as well to facilitate reproducibility.
Dataset Splits	Yes	Tasks. We mainly focus on reasoning problems in math and coding: (a) math problem solving on MATH (Hendrycks et al., 2021), and (b) code generation on MBPP (Austin et al., 2021) and Human Eval (Chen et al., 2021). We use the following train-test splits in our experiments: (1) MATH: following Lightman et al. (2023), we augment the MATH training set with 4500 problems from the test set, and report results on the remaining 500 problems (MATH500); and (2) Code generation: we train on MBPP and report results on Human Eval, which does not expose test cases to the model.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or other hardware specifications used for running the experiments. It mentions using Gemini 1.0 Pro and 1.5 Flash models and Gemma v2 models but does not specify the underlying hardware.
Software Dependencies	No	REPRODUCIBILITY STATEMENT: Our RL algorithms and infrastructure simply extends the methodology of Ahmadian et al. (2024) to multi-turn settings with relatively simple modiﬁcations.
Experiment Setup	Yes	B ADDITIONAL EXPERIMENT DETAILS Table 8: Hyperparameters for SCo Re on MATH (left) and MBPP (right) Hyperparameter Value Base model Gemini 1.5 Flash Optimizer Adam Learning rate 5e-6 Training steps 3000 Batch size 512 Sampling temperature 1.0 α 10 β1 0.01 β2 0.1