Process Reward Model with Q-value Rankings
Authors: Wendi Li, Yixuan Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive empirical evaluations across various sampling policies, language model backbones, and multi-step reasoning benchmarks show that PQM outperforms classification-based PRMs. The effectiveness of the comparative loss function is highlighted in our comprehensive ablation studies, confirming PQM s practical efficacy and theoretical advantage. |
| Researcher Affiliation | Academia | Wendi Li Department of Computer Science Huazhong University of Science and Technology EMAIL Yixuan Li Department of Computer Sciences University of Wisconsin-Madison EMAIL |
| Pseudocode | No | The paper describes methods and mathematical derivations in paragraph text and equations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our codes can be found at https://github.com/Windy Lee0822/Process Q Model. |
| Open Datasets | Yes | The test corpus includes 128 solutions for each question from GSM-Plus (Li et al., 2024) and MATH500 (Hendrycks et al., 2021) datasets. These solutions are sampled from three policy models with strong performance in math tasks with different scales: Meta Math-Mistral-7B (Yu et al., 2024), Muggle Math-13B (Li et al., 2023a), Llama-3-70B-Instruct (AI@Meta, 2024). We utilize the existing off-shelf corpus, Math-Shepherd (Wang et al., 2023a), as our training corpus. |
| Dataset Splits | Yes | The test corpus includes 128 solutions for each question from GSM-Plus (Li et al., 2024) and MATH500 (Hendrycks et al., 2021) datasets. These solutions are sampled from three policy models with strong performance in math tasks with different scales: Meta Math-Mistral-7B (Yu et al., 2024), Muggle Math-13B (Li et al., 2023a), Llama-3-70B-Instruct (AI@Meta, 2024). We utilize the existing off-shelf corpus, Math-Shepherd (Wang et al., 2023a), as our training corpus. ... To examine whether PQM robustly outperforms classification-based PRM across different dataset sizes, we randomly sample 25%, 50%, 75% of the original dataset to train PRMs with BCE loss and PQM loss. |
| Hardware Specification | Yes | All training is conducted on 8 NVIDIA A100-SXM4-80GB GPUs. |
| Software Dependencies | Yes | We list the versions of the important external packages as follows: torch==2.3.1, trl==0.8.0, flashattn==2.6.2, transformers==4.34.0, accelerate==0.33.0, deepspeed==0.13.1, nvidia-nccl-cu12==2.20.5. |
| Experiment Setup | Yes | The hyperparameters for the ablation studies are provided in Table 5, and each training session for the ablation study took approximately 4.5 hours. For the main experiments, some training data has tokenized sequences longer than 2048 tokens, which limited the batch size and reduced training efficiency. To address this, we divide the training corpus into three groups based on tokenized length: sequences shorter than 512 tokens, between 512 and 1024 tokens, and greater than 1024 tokens. The batch sizes were set to 64, 24, and 8, respectively, for these groups. This strategy reduced the training time from about eleven hours to six hours. |