Reinforce LLM Reasoning through Multi-Agent Reflection
Authors: Yurun Yuan, Tengyang Xie
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we instantiate DPSDP with various base models and show improvements on both in- and out-of-distribution benchmarks. For example, on benchmark MATH 500, majority voting over five refinement steps increases first-turn accuracy from 58.2% to 63.2% with Ministral-based models. An ablation study further confirms the benefits of multi-agent collaboration and out-of-distribution generalization. |
| Researcher Affiliation | Academia | 1Department of Computer Sciences, University of Wisconsin Madison, Madison, WI, USA. Correspondence to: Yurun Yuan <yurun EMAIL>, Tengyang Xie <EMAIL>. |
| Pseudocode | Yes | We formally outline DPSDP algorithm in Algorithm 1. The practical implementation of DPSDP is presented in Algorithm 2 and illustrated in Figure 2. We provide the pseudocode for original PSDP algorithm described in Section 2. Algorithm 3 PSDP |
| Open Source Code | No | We adapted code from TRL (von Werra et al., 2020) for both SFT and DPO training. The non-generative critic was trained using the Hugging Face Transformers framework (Wolf et al., 2020) with a customized loss function. For inference, we utilized the vLLM offline engine (Kwon et al., 2023) and adapted scripts from Xiong et al. (2024). Evaluation code was adapted from Yang et al. (2024a); Grattafiori et al. (2024) to compare LLM-generated answers with ground-truth solution. |
| Open Datasets | Yes | We focus on mathematical problem-solving tasks, evaluating our approach on MATH 500 (Hendrycks et al., 2021b) and GSM8K (Cobbe et al., 2021) benchmarks. We use problems from the Open Math Instruct-2 (Toshniwal et al., 2025) for training... To assess generalizability of our models to out-of-distribution problems, we evaluate on two additional benchmarks: MMLU-Pro Math (Wang et al., 2024b) and Olympiad Bench (He et al., 2024). |
| Dataset Splits | Yes | Following Lightman et al. (2023), we augment the MATH training set with 4500 problems from the test set and report results on the remaining 500 problems, denoted as MATH 500. We use problems from the Open Math Instruct-2 (Toshniwal et al., 2025) for training, which are sourced or augmented from MATH and GSM8K the same datasets used for benchmarking. |
| Hardware Specification | Yes | Base models were trained on the SFT dataset for 1 epoch, using gradient accumulation steps of 64 and a per-device train batch size of 1 on 4 H100 80GB GPUs. |
| Software Dependencies | No | We adapted code from TRL (von Werra et al., 2020) for both SFT and DPO training. The non-generative critic was trained using the Hugging Face Transformers framework (Wolf et al., 2020) with a customized loss function. For inference, we utilized the vLLM offline engine (Kwon et al., 2023) and adapted scripts from Xiong et al. (2024). |
| Experiment Setup | Yes | For supervised fine-tuning (SFT), we experimented with learning rates of 1e-6, 5e-6, and 1e-5, selecting 1e-6 for Llama-based models and 5e-6 for Ministral- and Qwen-based models. Base models were trained on the SFT dataset for 1 epoch, using gradient accumulation steps of 64 and a per-device train batch size of 1 on 4 H100 80GB GPUs. For direct preference optimization (DPO), we tested learning rates of 2e-7 and 4e-7, choosing 2e-7 for Ministral-based actor and critic, Llama-based actor, and Qwen-based actor, and 4e-7 for Llama- and Qwen-based critics. For the KL coefficient β, we evaluated values of 0.1, 0.5, and 1.0, selecting 0.1 for all actor model training, 1.0 for Ministral-8B-Instruct-based critic (trained for 1 epoch), and 0.1 for Llama-3.1-8B-Instruct-based critic (trained for 2 epochs) and Qwen2.5-3B-based critic (trained for 3 epochs). |