Entropy-Regularized Process Reward Model
Authors: Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo Molchanov, Tong Zhang
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models, achieving 1% improvement on GSM8K and 2-3% improvement on MATH under best-of-N evaluation, and more than 1% improvement under RLHF. |
| Researcher Affiliation | Collaboration | University of Illinois Urbana-Champaign University of Toronto NVIDIA Princeton University Salesforce Research |
| Pseudocode | No | The paper describes methods and processes using mathematical equations and textual explanations, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | To further boost the process reward model research, we will release all the data, code, and checkpoints to the community. |
| Open Datasets | Yes | We conduct experiments using two widely adopted mathematical datasets: GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021). We incorporated 12,500 novel questions from the Open Math Instruct-2 dataset (Toshniwal et al., 2024a) into our RLHF training data to conduct a blended training. Our process reward data is generated and labeled through an automatic labeling approach with entropy regularization, similar to Math-Shepherd (Wang et al., 2024). We conduct experiments using Mistral-7B (Jiang et al., 2023), fine-tuned on the Meta Math dataset (Yu et al., 2024) which we denote as Mistral-Meta Math-7B, and Deep Seek-math-7B-instruct (Shao et al., 2024). |
| Dataset Splits | Yes | We evaluate our reward models using the full test set of GSM8K and the MATH500 dataset introduced by Lightman et al. (2023). all results reported in this paper were evaluated exclusively on the test sets of GSM8K and MATH, never on training data. The Open Math Instruct-2 data was used only for training, never for evaluation. |
| Hardware Specification | Yes | The reward models are trained with a global batch size of 64, a learning rate of 1e-6, a maximum sequence length of 512, and a single epoch using four H100 GPUs. |
| Software Dependencies | No | The paper mentions several models used (e.g., Mistral-7B, Llama-3.1-8B) but does not provide specific version numbers for ancillary software like programming languages or libraries (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | The reward models are trained with a global batch size of 64, a learning rate of 1e-6, a maximum sequence length of 512, and a single epoch using four H100 GPUs. We perform hyperparameter tuning on the learning rate, exploring values from {5e-6,2e-6,1e-6,5e-7}. During the automatic labeling process, we set η = 2 for Mistral data, and η = 10 for Deep Seek data in our main experiments. We also conduct hyperparameter tuning for η across the values {0.1, 1, 2, 5, 8, 10, 15}. |