Training Language Models to Self-Correct via Reinforcement Learning

Authors: Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, JD Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental ABSTRACT Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCo Re, that significantly improves an LLM s self-correction ability using entirely self-generated data. ... With Gemini 1.0 Pro and 1.5 Flash models, we find that SCo Re achieves state-of-the-art self-correction performance, improving the base models self-correction by 15.6% and 9.1% respectively on MATH and Human Eval. ... 6 EXPERIMENTAL EVALUATION The goal of our experiments is to demonstrate the efficacy and justify the design of SCo Re in training LLMs how to self-correct by only training on their own data. To this end, we perform a comparative evaluation of SCo Re against prior methods that also use self-generated data to train for self-correction, and run several ablation studies on two representative reasoning tasks where error correction is crucial.
Researcher Affiliation Collaboration Aviral Kumar +, Vincent Zhuang +, Rishabh Agarwal , Yi Su , JD Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay Mc Kinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani , Aleksandra Faust Google Deep Mind Corresponding authors: [vincentzhuang, rishabhagarwal, yisumtv]@google.com, EMAIL
Pseudocode No The paper describes the SCo Re method in sections 3, 4, and 5 using prose. While it includes flowcharts (Figures 6 and 11) to illustrate the approach, there are no explicitly labeled pseudocode or algorithm blocks with structured steps in a code-like format.
Open Source Code No REPRODUCIBILITY STATEMENT: While we cannot release our fine-tuned models, we hope our detailed descriptions should help the research community replicate our findings.
Open Datasets Yes REPRODUCIBILITY STATEMENT: Our training and evaluations are performed on open-source benchmarks: MATH (Hendrycks et al., 2021), MBPP (Austin et al., 2021), and Human Eval (Chen et al., 2021), with all specific prompts used in Appendix C. We have also added results with the open Gemma 2 model in Appendix A.1 as well to facilitate reproducibility.
Dataset Splits Yes Tasks. We mainly focus on reasoning problems in math and coding: (a) math problem solving on MATH (Hendrycks et al., 2021), and (b) code generation on MBPP (Austin et al., 2021) and Human Eval (Chen et al., 2021). We use the following train-test splits in our experiments: (1) MATH: following Lightman et al. (2023), we augment the MATH training set with 4500 problems from the test set, and report results on the remaining 500 problems (MATH500); and (2) Code generation: we train on MBPP and report results on Human Eval, which does not expose test cases to the model.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or other hardware specifications used for running the experiments. It mentions using Gemini 1.0 Pro and 1.5 Flash models and Gemma v2 models but does not specify the underlying hardware.
Software Dependencies No REPRODUCIBILITY STATEMENT: Our RL algorithms and infrastructure simply extends the methodology of Ahmadian et al. (2024) to multi-turn settings with relatively simple modifications.
Experiment Setup Yes B ADDITIONAL EXPERIMENT DETAILS Table 8: Hyperparameters for SCo Re on MATH (left) and MBPP (right) Hyperparameter Value Base model Gemini 1.5 Flash Optimizer Adam Learning rate 5e-6 Training steps 3000 Batch size 512 Sampling temperature 1.0 α 10 β1 0.01 β2 0.1