reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Free Process Rewards without Process Labels

Authors: Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, Hao Peng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, we train our Implicit PRMs with various objectives and evaluate their performance on MATH. Implicit PRMs outperform strong MCTS-based baselines á la Math-Shepherd (Wang et al., 2023) using less than 1/38 of the training data.
Researcher Affiliation	Collaboration	1University of Illinois Urbana Champaign 2Tsinghua University 3Huazhong University of Science and Technology 4Shanghai AI Lab.
Pseudocode	No	The paper includes mathematical proofs (e.g., in Appendix A) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code, nor does it include a direct link to a code repository for the methodology described. It refers to open-source models used as baselines, but not its own implementation.
Open Datasets	Yes	In experiments, we train our Implicit PRMs on a dataset consisting of 33K math instructions and eight solutions for each, and evaluate them through the best-of-N sampling on MATH (Hendrycks et al., 2021). We use math instructions from Ultra Interact (Yuan et al., 2024) and sample eight rollouts per instruction using Llama-3.1-8B-Instruct. To this end, we incorporate general instructions from Ultra Feedback (Cui et al., 2024) and coding instructions from Ultra Interact (Yuan et al., 2024) into our training dataset.
Dataset Splits	Yes	We evaluate PRMs with best-of-N (Bo N) on MATH-500 (Hendrycks et al., 2021). ... We train our Implicit PRMs on a dataset consisting of 33K math instructions and eight solutions for each, and evaluate them through the best-of-N sampling on MATH (Hendrycks et al., 2021). ... Different reward models best-of-N sampling performance on MATH test set with three different generation models.
Hardware Specification	Yes	We present the GPU time costs on an A100 80G relative to that of the generation model in Table 3.
Software Dependencies	No	The paper mentions using "v LLM (Kwon et al., 2023)" and "Huggingface Accelerate (Gugger et al., 2022)" but does not provide specific version numbers for these or any other software libraries or programming languages used.
Experiment Setup	Yes	We train PRMs based on Llama-3.1-8B-Instruct with β = 0.05, which is empirically determined. ... For DPO and NCA, we pair each correct rollout with an incorrect counterpart and train our RM on these response-level pairs, while for KTO and CE loss, we directly train on the unpaired and imbalanced rollouts, which is more general in practical scenarios. We also implement two data balanced setup for CE to analyze the impact of pairwise data, i.e. balancing the positive and negative responses simply for the entire dataset, or more strictly for the each each instruction. We denote the two setups as Dataset-wise Balanced and Instruction-wise Balanceed.