Improving Rationality in the Reasoning Process of Language Models through Self-playing Game

Authors: Pinzheng Wang, Juntao Li, Zecheng Tang, Haijia Gui, Min Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on tasks involving mathematical reasoning, stepwise error detection, self-correction, and long-chain reasoning demonstrate that CDG training can significantly improve the ability of well-aligned LLMs to comprehend their reasoning process. To evaluate the effectiveness of CDG, we conduct experiments on four mathematics-related reasoning tasks using the fully fine-tuned LLa MA3.1-8B-Instruct model.
Researcher Affiliation Academia 1School of Computer Science and Technology, Soochow University 2Key Laboratory of Data Intelligence and Advanced Computing, Soochow University. Correspondence to: Juntao Li <EMAIL>.
Pseudocode Yes Algorithm 1 Data collection of Critic-Discernment Game. Algorithm 2 Self-play of Critic-Discernment Game.
Open Source Code Yes We have released our code here.
Open Datasets Yes We focus on the field of mathematical reasoning using two widely used datasets: GSM8K and MATH500 (Cobbe et al., 2021a;b).
Dataset Splits Yes The training sets contain 7,473 and 12,000 samples, respectively, while the test sets consist of 1,319 and 500 samples. The final dataset contains 200 positive samples and 200 negative samples.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or memory amounts) are provided in the paper.
Software Dependencies No We only mention the Sym Py grader (Meurer et al., 2017) but no specific version for it or any other software.
Experiment Setup Yes In our experiments, the values of τπ, τρ, and τµ are set to 0.5, 0.75, and 0.5, respectively. In the first training loop, we use a learning rate of 5e-6 and a batch size of 32 to facilitate rapid convergence. In the second loop, as the dataset size increases, we adjust the learning rate to 1e-6 and the batch size to 256. The prover and misleading critic are trained for one epoch, while the helpful critic is trained for two epochs. For DPO, we use a learning rate of 1e-6, a batch size of 64, and set β = 0.5. For PPO, We set the learning rate to 5e-6 and the batch size to 128. To control the deviation from the base reference policy, we set β = 0.2.