reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Process Reward Model with Q-value Rankings

Authors: Wendi Li, Yixuan Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive empirical evaluations across various sampling policies, language model backbones, and multi-step reasoning benchmarks show that PQM outperforms classification-based PRMs. The effectiveness of the comparative loss function is highlighted in our comprehensive ablation studies, confirming PQM s practical efficacy and theoretical advantage.
Researcher Affiliation	Academia	Wendi Li Department of Computer Science Huazhong University of Science and Technology EMAIL Yixuan Li Department of Computer Sciences University of Wisconsin-Madison EMAIL
Pseudocode	No	The paper describes methods and mathematical derivations in paragraph text and equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our codes can be found at https://github.com/Windy Lee0822/Process Q Model.
Open Datasets	Yes	The test corpus includes 128 solutions for each question from GSM-Plus (Li et al., 2024) and MATH500 (Hendrycks et al., 2021) datasets. These solutions are sampled from three policy models with strong performance in math tasks with different scales: Meta Math-Mistral-7B (Yu et al., 2024), Muggle Math-13B (Li et al., 2023a), Llama-3-70B-Instruct (AI@Meta, 2024). We utilize the existing off-shelf corpus, Math-Shepherd (Wang et al., 2023a), as our training corpus.
Dataset Splits	Yes	The test corpus includes 128 solutions for each question from GSM-Plus (Li et al., 2024) and MATH500 (Hendrycks et al., 2021) datasets. These solutions are sampled from three policy models with strong performance in math tasks with different scales: Meta Math-Mistral-7B (Yu et al., 2024), Muggle Math-13B (Li et al., 2023a), Llama-3-70B-Instruct (AI@Meta, 2024). We utilize the existing off-shelf corpus, Math-Shepherd (Wang et al., 2023a), as our training corpus. ... To examine whether PQM robustly outperforms classification-based PRM across different dataset sizes, we randomly sample 25%, 50%, 75% of the original dataset to train PRMs with BCE loss and PQM loss.
Hardware Specification	Yes	All training is conducted on 8 NVIDIA A100-SXM4-80GB GPUs.
Software Dependencies	Yes	We list the versions of the important external packages as follows: torch==2.3.1, trl==0.8.0, flashattn==2.6.2, transformers==4.34.0, accelerate==0.33.0, deepspeed==0.13.1, nvidia-nccl-cu12==2.20.5.
Experiment Setup	Yes	The hyperparameters for the ablation studies are provided in Table 5, and each training session for the ablation study took approximately 4.5 hours. For the main experiments, some training data has tokenized sequences longer than 2048 tokens, which limited the batch size and reduced training efficiency. To address this, we divide the training corpus into three groups based on tokenized length: sequences shorter than 512 tokens, between 512 and 1024 tokens, and greater than 1024 tokens. The batch sizes were set to 64, 24, and 8, respectively, for these groups. This strategy reduced the training time from about eleven hours to six hours.