Discriminative Policy Optimization for Token-Level Reward Models
Authors: Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, Ting Yao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 faster than ORM on GSM8K and 11 faster than step-level PRM on MATH. Code and data are available at https://github.com/homzer/Q-RM. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Engineering, Sun Yat-sen University, China 2Wechat Search, Tencent Inc, China. Hongzhan Chen <EMAIL>. Correspondence to: Xiaojun Quan <EMAIL>, Tao Yang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 PPO with Q-RM Optimization Input: Dataset D, learning rate η, clipping parameter ϵ, KL coefficient β, Q-RM, SFT model πSFT θ Initialize policy parameters θ Initialize value function parameters ψ repeat Sample a batch of trajectories {τi} from D using the current policy πθ for each trajectory τi do Compute token rewards Z (st, at) using Q-RM end for Compute the mean µ and variance σ2 of Z (st, at) across all positions in every trajectory within the batch. Standardized rewards: Z std(st, at) = Z (st,at) µ σ for each trajectory τi do Compute KL penalty: KL(st) = KL(πθ( |st) πSFT θ ( |st)) Compute the advantage At = Z std(st, at) βKL(st) Vψ(st) using the value function Vψ end for Compute the ratio ρt(θ) = πθ(at|st) πθold(at|st) Compute the clipped surrogate objective: LCLIP = min(ρt(θ)At, clip(ρt(θ), 1 ϵ, 1 + ϵ)At) Compute value loss: LVF = (Vψ(st) Z std(st, at))2 Update policy parameters θ: θ θ + η θLCLIP Update value function parameters ψ: ψ ψ + η ψLVF until convergence |
| Open Source Code | Yes | Code and data are available at https://github.com/homzer/Q-RM. |
| Open Datasets | Yes | For GSM8K and MATH, we utilize their original training sets. For each instruction, we leverage Llama-2-7B/13B/70B-Chat (Touvron et al., 2023), Llama-3-8B/70B-Instruct (Dubey et al., 2024), and Qwen-2.5-7B/14B/32B/72B-Instruct (Yang et al., 2024), totaling 9 types of models for sampling. This process generates 108 responses for each instruction, which are evaluated against the labels. The correct responses are selected as chosen samples, while incorrect ones are marked as rejected. After filtering and deduplication, we obtain 45,364 pairwise samples for GSM8K and 94,941 for MATH. We utilize a similar strategy to construct test preference data on the GSM8K test set, resulting 787 pairwise samples. For QA-Feedback, we utilize original preference data from their training set, totaling 17,835 pairs. For Alpaca Eval 2.0, we utilize Ultra Feedback2 as preference data, with a total of 52,712 pairs. |
| Dataset Splits | Yes | For GSM8K and MATH, we utilize their original training sets. For each instruction, we leverage Llama-2-7B/13B/70B-Chat (Touvron et al., 2023), Llama-3-8B/70B-Instruct (Dubey et al., 2024), and Qwen-2.5-7B/14B/32B/72B-Instruct (Yang et al., 2024), totaling 9 types of models for sampling. This process generates 108 responses for each instruction, which are evaluated against the labels. The correct responses are selected as chosen samples, while incorrect ones are marked as rejected. After filtering and deduplication, we obtain 45,364 pairwise samples for GSM8K and 94,941 for MATH. We utilize a similar strategy to construct test preference data on the GSM8K test set, resulting 787 pairwise samples. For QA-Feedback, we utilize original preference data from their training set, totaling 17,835 pairs. For Alpaca Eval 2.0, we utilize the instructions from Ultra Feedback for RL training. In direct preference optimization methods, the training preference data is the same as the RM training data. |
| Hardware Specification | Yes | All reward models are trained on 8 80 GB Nvidia A100 GPUs with bfloat16 precision and a Lo RA (Hu et al., 2022) rank of 128, using model parallelism with a size of 8. The maximum sequence length is set to 1024 for GSM8K and MATH, 2048 for Ultra Feedback and QA-Feedback. The Adam (Kingma, 2014) optimizer is employed. RL Training All policy models are trained on 8 40 GB Nvidia A100 GPUs, with bfloat16 precision and full parameter training, using model parallelism with a size of 8. |
| Software Dependencies | No | The paper mentions 'Adam (Kingma, 2014) optimizer', 'bfloat16 precision', 'Lo RA (Hu et al., 2022)', and refers to the 'TRL4 repository' but does not specify version numbers for any software libraries, programming languages, or specific frameworks like PyTorch or TensorFlow. |
| Experiment Setup | Yes | For PPO-based methods, we set the learning rate to 5e-6, the clipping range to 0.2, GAE lambda to 0.8, PPO gamma to 0.9, the KL coefficient to 0.01, and the rollout batch size to 3,072. For REINFORCE-based methods, the learning rate is set to 1e-6 with the same rollout batch size of 3,072. All policy models are trained with bfloat16 precision and full parameter optimization. More training details are provided in Appendix D. |