reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reinforce LLM Reasoning through Multi-Agent Reflection

Authors: Yurun Yuan, Tengyang Xie

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we instantiate DPSDP with various base models and show improvements on both in- and out-of-distribution benchmarks. For example, on benchmark MATH 500, majority voting over five refinement steps increases first-turn accuracy from 58.2% to 63.2% with Ministral-based models. An ablation study further confirms the benefits of multi-agent collaboration and out-of-distribution generalization.
Researcher Affiliation	Academia	1Department of Computer Sciences, University of Wisconsin Madison, Madison, WI, USA. Correspondence to: Yurun Yuan <yurun EMAIL>, Tengyang Xie <EMAIL>.
Pseudocode	Yes	We formally outline DPSDP algorithm in Algorithm 1. The practical implementation of DPSDP is presented in Algorithm 2 and illustrated in Figure 2. We provide the pseudocode for original PSDP algorithm described in Section 2. Algorithm 3 PSDP
Open Source Code	No	We adapted code from TRL (von Werra et al., 2020) for both SFT and DPO training. The non-generative critic was trained using the Hugging Face Transformers framework (Wolf et al., 2020) with a customized loss function. For inference, we utilized the vLLM offline engine (Kwon et al., 2023) and adapted scripts from Xiong et al. (2024). Evaluation code was adapted from Yang et al. (2024a); Grattafiori et al. (2024) to compare LLM-generated answers with ground-truth solution.
Open Datasets	Yes	We focus on mathematical problem-solving tasks, evaluating our approach on MATH 500 (Hendrycks et al., 2021b) and GSM8K (Cobbe et al., 2021) benchmarks. We use problems from the Open Math Instruct-2 (Toshniwal et al., 2025) for training... To assess generalizability of our models to out-of-distribution problems, we evaluate on two additional benchmarks: MMLU-Pro Math (Wang et al., 2024b) and Olympiad Bench (He et al., 2024).
Dataset Splits	Yes	Following Lightman et al. (2023), we augment the MATH training set with 4500 problems from the test set and report results on the remaining 500 problems, denoted as MATH 500. We use problems from the Open Math Instruct-2 (Toshniwal et al., 2025) for training, which are sourced or augmented from MATH and GSM8K the same datasets used for benchmarking.
Hardware Specification	Yes	Base models were trained on the SFT dataset for 1 epoch, using gradient accumulation steps of 64 and a per-device train batch size of 1 on 4 H100 80GB GPUs.
Software Dependencies	No	We adapted code from TRL (von Werra et al., 2020) for both SFT and DPO training. The non-generative critic was trained using the Hugging Face Transformers framework (Wolf et al., 2020) with a customized loss function. For inference, we utilized the vLLM offline engine (Kwon et al., 2023) and adapted scripts from Xiong et al. (2024).
Experiment Setup	Yes	For supervised fine-tuning (SFT), we experimented with learning rates of 1e-6, 5e-6, and 1e-5, selecting 1e-6 for Llama-based models and 5e-6 for Ministral- and Qwen-based models. Base models were trained on the SFT dataset for 1 epoch, using gradient accumulation steps of 64 and a per-device train batch size of 1 on 4 H100 80GB GPUs. For direct preference optimization (DPO), we tested learning rates of 2e-7 and 4e-7, choosing 2e-7 for Ministral-based actor and critic, Llama-based actor, and Qwen-based actor, and 4e-7 for Llama- and Qwen-based critics. For the KL coefficient β, we evaluated values of 0.1, 0.5, and 1.0, selecting 0.1 for all actor model training, 1.0 for Ministral-8B-Instruct-based critic (trained for 1 epoch), and 0.1 for Llama-3.1-8B-Instruct-based critic (trained for 2 epochs) and Qwen2.5-3B-based critic (trained for 3 epochs).