reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Rationality in the Reasoning Process of Language Models through Self-playing Game

Authors: Pinzheng Wang, Juntao Li, Zecheng Tang, Haijia Gui, Min Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on tasks involving mathematical reasoning, stepwise error detection, self-correction, and long-chain reasoning demonstrate that CDG training can significantly improve the ability of well-aligned LLMs to comprehend their reasoning process. To evaluate the effectiveness of CDG, we conduct experiments on four mathematics-related reasoning tasks using the fully fine-tuned LLa MA3.1-8B-Instruct model.
Researcher Affiliation	Academia	1School of Computer Science and Technology, Soochow University 2Key Laboratory of Data Intelligence and Advanced Computing, Soochow University. Correspondence to: Juntao Li <EMAIL>.
Pseudocode	Yes	Algorithm 1 Data collection of Critic-Discernment Game. Algorithm 2 Self-play of Critic-Discernment Game.
Open Source Code	Yes	We have released our code here.
Open Datasets	Yes	We focus on the field of mathematical reasoning using two widely used datasets: GSM8K and MATH500 (Cobbe et al., 2021a;b).
Dataset Splits	Yes	The training sets contain 7,473 and 12,000 samples, respectively, while the test sets consist of 1,319 and 500 samples. The final dataset contains 200 positive samples and 200 negative samples.
Hardware Specification	No	No specific hardware details (like GPU models, CPU types, or memory amounts) are provided in the paper.
Software Dependencies	No	We only mention the Sym Py grader (Meurer et al., 2017) but no specific version for it or any other software.
Experiment Setup	Yes	In our experiments, the values of τπ, τρ, and τµ are set to 0.5, 0.75, and 0.5, respectively. In the first training loop, we use a learning rate of 5e-6 and a batch size of 32 to facilitate rapid convergence. In the second loop, as the dataset size increases, we adjust the learning rate to 1e-6 and the batch size to 256. The prover and misleading critic are trained for one epoch, while the helpful critic is trained for two epochs. For DPO, we use a learning rate of 1e-6, a batch size of 64, and set β = 0.5. For PPO, We set the learning rate to 5e-6 and the batch size to 128. To control the deviation from the base reference policy, we set β = 0.2.