reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Discriminative Policy Optimization for Token-Level Reward Models

Authors: Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, Ting Yao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 faster than ORM on GSM8K and 11 faster than step-level PRM on MATH. Code and data are available at https://github.com/homzer/Q-RM.
Researcher Affiliation	Collaboration	1School of Computer Science and Engineering, Sun Yat-sen University, China 2Wechat Search, Tencent Inc, China. Hongzhan Chen <EMAIL>. Correspondence to: Xiaojun Quan <EMAIL>, Tao Yang <EMAIL>.
Pseudocode	Yes	Algorithm 1 PPO with Q-RM Optimization Input: Dataset D, learning rate η, clipping parameter ϵ, KL coefficient β, Q-RM, SFT model πSFT θ Initialize policy parameters θ Initialize value function parameters ψ repeat Sample a batch of trajectories {τi} from D using the current policy πθ for each trajectory τi do Compute token rewards Z (st, at) using Q-RM end for Compute the mean µ and variance σ2 of Z (st, at) across all positions in every trajectory within the batch. Standardized rewards: Z std(st, at) = Z (st,at) µ σ for each trajectory τi do Compute KL penalty: KL(st) = KL(πθ( \|st) πSFT θ ( \|st)) Compute the advantage At = Z std(st, at) βKL(st) Vψ(st) using the value function Vψ end for Compute the ratio ρt(θ) = πθ(at\|st) πθold(at\|st) Compute the clipped surrogate objective: LCLIP = min(ρt(θ)At, clip(ρt(θ), 1 ϵ, 1 + ϵ)At) Compute value loss: LVF = (Vψ(st) Z std(st, at))2 Update policy parameters θ: θ θ + η θLCLIP Update value function parameters ψ: ψ ψ + η ψLVF until convergence
Open Source Code	Yes	Code and data are available at https://github.com/homzer/Q-RM.
Open Datasets	Yes	For GSM8K and MATH, we utilize their original training sets. For each instruction, we leverage Llama-2-7B/13B/70B-Chat (Touvron et al., 2023), Llama-3-8B/70B-Instruct (Dubey et al., 2024), and Qwen-2.5-7B/14B/32B/72B-Instruct (Yang et al., 2024), totaling 9 types of models for sampling. This process generates 108 responses for each instruction, which are evaluated against the labels. The correct responses are selected as chosen samples, while incorrect ones are marked as rejected. After filtering and deduplication, we obtain 45,364 pairwise samples for GSM8K and 94,941 for MATH. We utilize a similar strategy to construct test preference data on the GSM8K test set, resulting 787 pairwise samples. For QA-Feedback, we utilize original preference data from their training set, totaling 17,835 pairs. For Alpaca Eval 2.0, we utilize Ultra Feedback2 as preference data, with a total of 52,712 pairs.
Dataset Splits	Yes	For GSM8K and MATH, we utilize their original training sets. For each instruction, we leverage Llama-2-7B/13B/70B-Chat (Touvron et al., 2023), Llama-3-8B/70B-Instruct (Dubey et al., 2024), and Qwen-2.5-7B/14B/32B/72B-Instruct (Yang et al., 2024), totaling 9 types of models for sampling. This process generates 108 responses for each instruction, which are evaluated against the labels. The correct responses are selected as chosen samples, while incorrect ones are marked as rejected. After filtering and deduplication, we obtain 45,364 pairwise samples for GSM8K and 94,941 for MATH. We utilize a similar strategy to construct test preference data on the GSM8K test set, resulting 787 pairwise samples. For QA-Feedback, we utilize original preference data from their training set, totaling 17,835 pairs. For Alpaca Eval 2.0, we utilize the instructions from Ultra Feedback for RL training. In direct preference optimization methods, the training preference data is the same as the RM training data.
Hardware Specification	Yes	All reward models are trained on 8 80 GB Nvidia A100 GPUs with bfloat16 precision and a Lo RA (Hu et al., 2022) rank of 128, using model parallelism with a size of 8. The maximum sequence length is set to 1024 for GSM8K and MATH, 2048 for Ultra Feedback and QA-Feedback. The Adam (Kingma, 2014) optimizer is employed. RL Training All policy models are trained on 8 40 GB Nvidia A100 GPUs, with bfloat16 precision and full parameter training, using model parallelism with a size of 8.
Software Dependencies	No	The paper mentions 'Adam (Kingma, 2014) optimizer', 'bfloat16 precision', 'Lo RA (Hu et al., 2022)', and refers to the 'TRL4 repository' but does not specify version numbers for any software libraries, programming languages, or specific frameworks like PyTorch or TensorFlow.
Experiment Setup	Yes	For PPO-based methods, we set the learning rate to 5e-6, the clipping range to 0.2, GAE lambda to 0.8, PPO gamma to 0.9, the KL coefficient to 0.01, and the rollout batch size to 3,072. For REINFORCE-based methods, the learning rate is set to 1e-6 with the same rollout batch size of 3,072. All policy models are trained with bfloat16 precision and full parameter optimization. More training details are provided in Appendix D.