What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning
Authors: Yiran Ma, Zui Chen, Tianqiao Liu, Mi Tian, Zhuo Liu, Zitao Liu, Weiqi Luo
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Main Results After collecting all the step-level preference pairs through MCTS, datasets are constructed for FC-SRM, MO-SRM, SSMO-SRM, and NT-SRM training by selecting the corresponding components in each piece of data. The training curves are shown in Figure 3. These SRMs are subsequently used as scoring functions in greedy search, the accuracy and absolute gains over baseline are reported in Table 1. |
| Researcher Affiliation | Collaboration | 1Zhejiang University, Hangzhou, China 2Shanghai Tech University, Shanghai, China 3TAL Education Group, Beijing, China 4University of Rochester, New York, USA 5Jinan University, Guangzhou, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Beam Search Algorithm |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available. |
| Open Datasets | Yes | To construct step-level preference pairs through MCTS, we use the math problems and their corresponding final answers from the training data of GSM8K (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021). |
| Dataset Splits | Yes | To construct step-level preference pairs through MCTS, we use the math problems and their corresponding final answers from the training data of GSM8K (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021). The accuracies are evaluated on the test data. |
| Hardware Specification | Yes | Each SRM is trained on two instances, with each instance equipped with 8 A800 GPUs. |
| Software Dependencies | No | The paper mentions specific LLM models (Llama-3-8B-Instruct, Deep Seek-Math-7B-Base, Qwen2-7B) but does not provide specific version numbers for software dependencies such as programming languages or libraries (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The MCTS requires the agent sampling n = 6 candidate actions at each expansion phase and iterates 500 times on each problem to evaluate the quality of each node. Notably, to avoid the influence of the variation of answer format, we use a supervised fine-tuned (SFT) model based on Deep Seek-Math-7B-Base to assert the correctness of the solution after each rollout during the search. This model is also used in our evaluation pipeline. To strengthen the preferences, only the preference pairs whose difference of value is greater than 0.7 are assumed valid. For detailed hyperparameters, see Appendix. |