OpenPRM: Building Open-domain Process-based Reward Models with Preference Trees

Authors: Kaiyan Zhang, Jiayuan Zhang, Haoxin Li, Xuekai Zhu, Ermo Hua, Xingtai Lv, Ning Ding, Biqing Qi, Bowen Zhou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We assess the performance of Open PRM across various reward benchmarks, demonstrating its competitive edge over traditional ORMs in open domains and PRMs in specialized domains. Additionally, we investigate the scalability of inference-time computation for open-domain instructions. Our results highlight the limitations of ORMs scalability, while Open PRM shows superior performance in scaled settings. Despite these advances, achieving automatic fine-grained supervision for open-domain inference-time scaling remains a substantial challenge.
Researcher Affiliation Academia Kaiyan Zhang1 Jiayuan Zhang2 Haoxin Li1 Xuekai Zhu3 Ermo Hua1 Xingtai Lv1 Ning Ding1 Biqing Qi4 Bowen Zhou1,4 1 Tsinghua University 2 Beihang University 3 Shanghai Jiao Tong University 4 Shanghai Artificial Intelligence Laboratory
Pseudocode No The paper describes methods using equations and prose, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code No The paper refers to third-party open-source reward models and states, "We will release all of these sampling data to the public to encourage further study on process and outcome reward models for inference-time scaling." However, it does not explicitly state that the source code for the methodology described in *this paper* is open-source or provide a link to it.
Open Datasets Yes This construction utilizes the Ultra Feedback (Cui et al., 2023) and Science QA (Lu et al., 2022) datasets, which provide a highly diverse and high-quality range of instructions. Additionally, we incorporate the MATH (Hendrycks et al., 2021) dataset to further enhance the math reasoning capabilities of our reward system.
Dataset Splits Yes We evaluate the effectiveness of process supervision of reward models solely on the test set of PRM800k (Lightman et al., 2023), which features high-quality human annotations. ... We randomly sample 500 questions from the test data for our evaluations. ... We use the MATH500 version, which contains 500 samples that maintain IID consistency with the original test dataset, to evaluate the mathematics abilities of LLMs under scaled inference-time settings.
Hardware Specification No The paper mentions running experiments with Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct models and refers to the 'vLLM engine', but it does not provide specific details on the underlying hardware (e.g., GPU models, CPU types, memory) used for these experiments.
Software Dependencies No The paper mentions using the 'v LLM engine 3' with a GitHub link in a footnote. However, it does not specify a version number for vLLM or any other software dependencies crucial for replication.
Experiment Setup Yes The threshold for segment aggregation is set differently for all tasks based on the distribution of similarity, and the rewards gap threshold between process pairs is 1.0 for Ultra Feedback and 0.2 for Math and Science QA. ... We set the temperature to 0.5 and top-p to 1.0 for repeated sampling with v LLM engine 3. For reproducing the Ultra Feedback and Help Steer2 reward models, we finetune Llama-3-8B-Instruct using a learning rate 5 10 6 over 1 epoch. Meanwhile, we finetune Intern RM and Fsfair X on process pairs using a learning rate 1 10 6 over 1 epoch. All models are finetuned with a batch size of 64 and a maximum sequence length of 2048.