Improving Rationality in the Reasoning Process of Language Models through Self-playing Game
Authors: Pinzheng Wang, Juntao Li, Zecheng Tang, Haijia Gui, Min Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on tasks involving mathematical reasoning, stepwise error detection, self-correction, and long-chain reasoning demonstrate that CDG training can significantly improve the ability of well-aligned LLMs to comprehend their reasoning process. To evaluate the effectiveness of CDG, we conduct experiments on four mathematics-related reasoning tasks using the fully fine-tuned LLa MA3.1-8B-Instruct model. |
| Researcher Affiliation | Academia | 1School of Computer Science and Technology, Soochow University 2Key Laboratory of Data Intelligence and Advanced Computing, Soochow University. Correspondence to: Juntao Li <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Data collection of Critic-Discernment Game. Algorithm 2 Self-play of Critic-Discernment Game. |
| Open Source Code | Yes | We have released our code here. |
| Open Datasets | Yes | We focus on the field of mathematical reasoning using two widely used datasets: GSM8K and MATH500 (Cobbe et al., 2021a;b). |
| Dataset Splits | Yes | The training sets contain 7,473 and 12,000 samples, respectively, while the test sets consist of 1,319 and 500 samples. The final dataset contains 200 positive samples and 200 negative samples. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or memory amounts) are provided in the paper. |
| Software Dependencies | No | We only mention the Sym Py grader (Meurer et al., 2017) but no specific version for it or any other software. |
| Experiment Setup | Yes | In our experiments, the values of τπ, τρ, and τµ are set to 0.5, 0.75, and 0.5, respectively. In the first training loop, we use a learning rate of 5e-6 and a batch size of 32 to facilitate rapid convergence. In the second loop, as the dataset size increases, we adjust the learning rate to 1e-6 and the batch size to 256. The prover and misleading critic are trained for one epoch, while the helpful critic is trained for two epochs. For DPO, we use a learning rate of 1e-6, a batch size of 64, and set β = 0.5. For PPO, We set the learning rate to 5e-6 and the batch size to 128. To control the deviation from the base reference policy, we set β = 0.2. |