Entropy-Regularized Process Reward Model

Authors: Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo Molchanov, Tong Zhang

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models, achieving 1% improvement on GSM8K and 2-3% improvement on MATH under best-of-N evaluation, and more than 1% improvement under RLHF.
Researcher Affiliation Collaboration University of Illinois Urbana-Champaign University of Toronto NVIDIA Princeton University Salesforce Research
Pseudocode No The paper describes methods and processes using mathematical equations and textual explanations, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No To further boost the process reward model research, we will release all the data, code, and checkpoints to the community.
Open Datasets Yes We conduct experiments using two widely adopted mathematical datasets: GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021). We incorporated 12,500 novel questions from the Open Math Instruct-2 dataset (Toshniwal et al., 2024a) into our RLHF training data to conduct a blended training. Our process reward data is generated and labeled through an automatic labeling approach with entropy regularization, similar to Math-Shepherd (Wang et al., 2024). We conduct experiments using Mistral-7B (Jiang et al., 2023), fine-tuned on the Meta Math dataset (Yu et al., 2024) which we denote as Mistral-Meta Math-7B, and Deep Seek-math-7B-instruct (Shao et al., 2024).
Dataset Splits Yes We evaluate our reward models using the full test set of GSM8K and the MATH500 dataset introduced by Lightman et al. (2023). all results reported in this paper were evaluated exclusively on the test sets of GSM8K and MATH, never on training data. The Open Math Instruct-2 data was used only for training, never for evaluation.
Hardware Specification Yes The reward models are trained with a global batch size of 64, a learning rate of 1e-6, a maximum sequence length of 512, and a single epoch using four H100 GPUs.
Software Dependencies No The paper mentions several models used (e.g., Mistral-7B, Llama-3.1-8B) but does not provide specific version numbers for ancillary software like programming languages or libraries (e.g., Python, PyTorch versions).
Experiment Setup Yes The reward models are trained with a global batch size of 64, a learning rate of 1e-6, a maximum sequence length of 512, and a single epoch using four H100 GPUs. We perform hyperparameter tuning on the learning rate, exploring values from {5e-6,2e-6,1e-6,5e-7}. During the automatic labeling process, we set η = 2 for Mistral data, and η = 10 for Deep Seek data in our main experiments. We also conduct hyperparameter tuning for η across the values {0.1, 1, 2, 5, 8, 10, 15}.