Free Process Rewards without Process Labels

Authors: Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, Hao Peng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we train our Implicit PRMs with various objectives and evaluate their performance on MATH. Implicit PRMs outperform strong MCTS-based baselines á la Math-Shepherd (Wang et al., 2023) using less than 1/38 of the training data.
Researcher Affiliation Collaboration 1University of Illinois Urbana Champaign 2Tsinghua University 3Huazhong University of Science and Technology 4Shanghai AI Lab.
Pseudocode No The paper includes mathematical proofs (e.g., in Appendix A) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing its own source code, nor does it include a direct link to a code repository for the methodology described. It refers to open-source models used as baselines, but not its own implementation.
Open Datasets Yes In experiments, we train our Implicit PRMs on a dataset consisting of 33K math instructions and eight solutions for each, and evaluate them through the best-of-N sampling on MATH (Hendrycks et al., 2021). We use math instructions from Ultra Interact (Yuan et al., 2024) and sample eight rollouts per instruction using Llama-3.1-8B-Instruct. To this end, we incorporate general instructions from Ultra Feedback (Cui et al., 2024) and coding instructions from Ultra Interact (Yuan et al., 2024) into our training dataset.
Dataset Splits Yes We evaluate PRMs with best-of-N (Bo N) on MATH-500 (Hendrycks et al., 2021). ... We train our Implicit PRMs on a dataset consisting of 33K math instructions and eight solutions for each, and evaluate them through the best-of-N sampling on MATH (Hendrycks et al., 2021). ... Different reward models best-of-N sampling performance on MATH test set with three different generation models.
Hardware Specification Yes We present the GPU time costs on an A100 80G relative to that of the generation model in Table 3.
Software Dependencies No The paper mentions using "v LLM (Kwon et al., 2023)" and "Huggingface Accelerate (Gugger et al., 2022)" but does not provide specific version numbers for these or any other software libraries or programming languages used.
Experiment Setup Yes We train PRMs based on Llama-3.1-8B-Instruct with β = 0.05, which is empirically determined. ... For DPO and NCA, we pair each correct rollout with an incorrect counterpart and train our RM on these response-level pairs, while for KTO and CE loss, we directly train on the unpaired and imbalanced rollouts, which is more general in practical scenarios. We also implement two data balanced setup for CE to analyze the impact of pairwise data, i.e. balancing the positive and negative responses simply for the entire dataset, or more strictly for the each each instruction. We denote the two setups as Dataset-wise Balanced and Instruction-wise Balanceed.