reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OpenPRM: Building Open-domain Process-based Reward Models with Preference Trees

Authors: Kaiyan Zhang, Jiayuan Zhang, Haoxin Li, Xuekai Zhu, Ermo Hua, Xingtai Lv, Ning Ding, Biqing Qi, Bowen Zhou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We assess the performance of Open PRM across various reward benchmarks, demonstrating its competitive edge over traditional ORMs in open domains and PRMs in specialized domains. Additionally, we investigate the scalability of inference-time computation for open-domain instructions. Our results highlight the limitations of ORMs scalability, while Open PRM shows superior performance in scaled settings. Despite these advances, achieving automatic fine-grained supervision for open-domain inference-time scaling remains a substantial challenge.
Researcher Affiliation	Academia	Kaiyan Zhang1 Jiayuan Zhang2 Haoxin Li1 Xuekai Zhu3 Ermo Hua1 Xingtai Lv1 Ning Ding1 Biqing Qi4 Bowen Zhou1,4 1 Tsinghua University 2 Beihang University 3 Shanghai Jiao Tong University 4 Shanghai Artificial Intelligence Laboratory
Pseudocode	No	The paper describes methods using equations and prose, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	No	The paper refers to third-party open-source reward models and states, "We will release all of these sampling data to the public to encourage further study on process and outcome reward models for inference-time scaling." However, it does not explicitly state that the source code for the methodology described in this paper is open-source or provide a link to it.
Open Datasets	Yes	This construction utilizes the Ultra Feedback (Cui et al., 2023) and Science QA (Lu et al., 2022) datasets, which provide a highly diverse and high-quality range of instructions. Additionally, we incorporate the MATH (Hendrycks et al., 2021) dataset to further enhance the math reasoning capabilities of our reward system.
Dataset Splits	Yes	We evaluate the effectiveness of process supervision of reward models solely on the test set of PRM800k (Lightman et al., 2023), which features high-quality human annotations. ... We randomly sample 500 questions from the test data for our evaluations. ... We use the MATH500 version, which contains 500 samples that maintain IID consistency with the original test dataset, to evaluate the mathematics abilities of LLMs under scaled inference-time settings.
Hardware Specification	No	The paper mentions running experiments with Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct models and refers to the 'vLLM engine', but it does not provide specific details on the underlying hardware (e.g., GPU models, CPU types, memory) used for these experiments.
Software Dependencies	No	The paper mentions using the 'v LLM engine 3' with a GitHub link in a footnote. However, it does not specify a version number for vLLM or any other software dependencies crucial for replication.
Experiment Setup	Yes	The threshold for segment aggregation is set differently for all tasks based on the distribution of similarity, and the rewards gap threshold between process pairs is 1.0 for Ultra Feedback and 0.2 for Math and Science QA. ... We set the temperature to 0.5 and top-p to 1.0 for repeated sampling with v LLM engine 3. For reproducing the Ultra Feedback and Help Steer2 reward models, we finetune Llama-3-8B-Instruct using a learning rate 5 10 6 over 1 epoch. Meanwhile, we finetune Intern RM and Fsfair X on process pairs using a learning rate 1 10 6 over 1 epoch. All models are finetuned with a batch size of 64 and a maximum sequence length of 2048.