reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Entropy-Regularized Process Reward Model

Authors: Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo Molchanov, Tong Zhang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models, achieving 1% improvement on GSM8K and 2-3% improvement on MATH under best-of-N evaluation, and more than 1% improvement under RLHF.
Researcher Affiliation	Collaboration	University of Illinois Urbana-Champaign University of Toronto NVIDIA Princeton University Salesforce Research
Pseudocode	No	The paper describes methods and processes using mathematical equations and textual explanations, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	No	To further boost the process reward model research, we will release all the data, code, and checkpoints to the community.
Open Datasets	Yes	We conduct experiments using two widely adopted mathematical datasets: GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021). We incorporated 12,500 novel questions from the Open Math Instruct-2 dataset (Toshniwal et al., 2024a) into our RLHF training data to conduct a blended training. Our process reward data is generated and labeled through an automatic labeling approach with entropy regularization, similar to Math-Shepherd (Wang et al., 2024). We conduct experiments using Mistral-7B (Jiang et al., 2023), fine-tuned on the Meta Math dataset (Yu et al., 2024) which we denote as Mistral-Meta Math-7B, and Deep Seek-math-7B-instruct (Shao et al., 2024).
Dataset Splits	Yes	We evaluate our reward models using the full test set of GSM8K and the MATH500 dataset introduced by Lightman et al. (2023). all results reported in this paper were evaluated exclusively on the test sets of GSM8K and MATH, never on training data. The Open Math Instruct-2 data was used only for training, never for evaluation.
Hardware Specification	Yes	The reward models are trained with a global batch size of 64, a learning rate of 1e-6, a maximum sequence length of 512, and a single epoch using four H100 GPUs.
Software Dependencies	No	The paper mentions several models used (e.g., Mistral-7B, Llama-3.1-8B) but does not provide specific version numbers for ancillary software like programming languages or libraries (e.g., Python, PyTorch versions).
Experiment Setup	Yes	The reward models are trained with a global batch size of 64, a learning rate of 1e-6, a maximum sequence length of 512, and a single epoch using four H100 GPUs. We perform hyperparameter tuning on the learning rate, exploring values from {5e-6,2e-6,1e-6,5e-7}. During the automatic labeling process, we set η = 2 for Mistral data, and η = 10 for Deep Seek data in our main experiments. We also conduct hyperparameter tuning for η across the values {0.1, 1, 2, 5, 8, 10, 15}.