What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning

Authors: Yiran Ma, Zui Chen, Tianqiao Liu, Mi Tian, Zhuo Liu, Zitao Liu, Weiqi Luo

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Main Results After collecting all the step-level preference pairs through MCTS, datasets are constructed for FC-SRM, MO-SRM, SSMO-SRM, and NT-SRM training by selecting the corresponding components in each piece of data. The training curves are shown in Figure 3. These SRMs are subsequently used as scoring functions in greedy search, the accuracy and absolute gains over baseline are reported in Table 1.
Researcher Affiliation Collaboration 1Zhejiang University, Hangzhou, China 2Shanghai Tech University, Shanghai, China 3TAL Education Group, Beijing, China 4University of Rochester, New York, USA 5Jinan University, Guangzhou, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Beam Search Algorithm
Open Source Code No The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available.
Open Datasets Yes To construct step-level preference pairs through MCTS, we use the math problems and their corresponding final answers from the training data of GSM8K (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021).
Dataset Splits Yes To construct step-level preference pairs through MCTS, we use the math problems and their corresponding final answers from the training data of GSM8K (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021). The accuracies are evaluated on the test data.
Hardware Specification Yes Each SRM is trained on two instances, with each instance equipped with 8 A800 GPUs.
Software Dependencies No The paper mentions specific LLM models (Llama-3-8B-Instruct, Deep Seek-Math-7B-Base, Qwen2-7B) but does not provide specific version numbers for software dependencies such as programming languages or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The MCTS requires the agent sampling n = 6 candidate actions at each expansion phase and iterates 500 times on each problem to evaluate the quality of each node. Notably, to avoid the influence of the variation of answer format, we use a supervised fine-tuned (SFT) model based on Deep Seek-Math-7B-Base to assert the correctness of the solution after each rollout during the search. This model is also used in our evaluation pipeline. To strengthen the preferences, only the preference pairs whose difference of value is greater than 0.7 are assumed valid. For detailed hyperparameters, see Appendix.