Process Reward Model with Q-value Rankings

Authors: Wendi Li, Yixuan Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive empirical evaluations across various sampling policies, language model backbones, and multi-step reasoning benchmarks show that PQM outperforms classification-based PRMs. The effectiveness of the comparative loss function is highlighted in our comprehensive ablation studies, confirming PQM s practical efficacy and theoretical advantage.
Researcher Affiliation Academia Wendi Li Department of Computer Science Huazhong University of Science and Technology EMAIL Yixuan Li Department of Computer Sciences University of Wisconsin-Madison EMAIL
Pseudocode No The paper describes methods and mathematical derivations in paragraph text and equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our codes can be found at https://github.com/Windy Lee0822/Process Q Model.
Open Datasets Yes The test corpus includes 128 solutions for each question from GSM-Plus (Li et al., 2024) and MATH500 (Hendrycks et al., 2021) datasets. These solutions are sampled from three policy models with strong performance in math tasks with different scales: Meta Math-Mistral-7B (Yu et al., 2024), Muggle Math-13B (Li et al., 2023a), Llama-3-70B-Instruct (AI@Meta, 2024). We utilize the existing off-shelf corpus, Math-Shepherd (Wang et al., 2023a), as our training corpus.
Dataset Splits Yes The test corpus includes 128 solutions for each question from GSM-Plus (Li et al., 2024) and MATH500 (Hendrycks et al., 2021) datasets. These solutions are sampled from three policy models with strong performance in math tasks with different scales: Meta Math-Mistral-7B (Yu et al., 2024), Muggle Math-13B (Li et al., 2023a), Llama-3-70B-Instruct (AI@Meta, 2024). We utilize the existing off-shelf corpus, Math-Shepherd (Wang et al., 2023a), as our training corpus. ... To examine whether PQM robustly outperforms classification-based PRM across different dataset sizes, we randomly sample 25%, 50%, 75% of the original dataset to train PRMs with BCE loss and PQM loss.
Hardware Specification Yes All training is conducted on 8 NVIDIA A100-SXM4-80GB GPUs.
Software Dependencies Yes We list the versions of the important external packages as follows: torch==2.3.1, trl==0.8.0, flashattn==2.6.2, transformers==4.34.0, accelerate==0.33.0, deepspeed==0.13.1, nvidia-nccl-cu12==2.20.5.
Experiment Setup Yes The hyperparameters for the ablation studies are provided in Table 5, and each training session for the ablation study took approximately 4.5 hours. For the main experiments, some training data has tokenized sequences longer than 2048 tokens, which limited the batch size and reduced training efficiency. To address this, we divide the training corpus into three groups based on tokenized length: sequences shorter than 512 tokens, between 512 and 1024 tokens, and greater than 1024 tokens. The batch sizes were set to 64, 24, and 8, respectively, for these groups. This strategy reduced the training time from about eleven hours to six hours.