VinePPO: Refining Credit Assignment in RL Training of LLMs
Authors: Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, Nicolas Le Roux
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate the effectiveness and computational efficiency of MC value estimation in Vine PPO. Across multiple mathematical reasoning tasks, Vine PPO consistently outperforms PPO and other credit assignment-free baselines. While its per-iteration runtime is generally slower due to MC sampling, Vine PPO surpasses the peak performance of baselines with fewer gradient steps and ultimately less wall-clock time. Importantly, Vine PPO achieves higher test accuracy for a given training accuracy, capturing more generalization signal per fitted training sample. |
| Researcher Affiliation | Collaboration | 1Mila 2Microsoft Research 3Mc Gill University 4Canada CIFAR AI Chair 5Universit e de Montr eal 6HEC Montr eal. |
| Pseudocode | No | The paper describes the methods, including PPO and Vine PPO, using mathematical formulations and textual descriptions but does not contain a clearly labeled pseudocode block or algorithm steps formatted as code. |
| Open Source Code | Yes | Code available at https://github.com/ Mc Gill-NLP/Vine PPO |
| Open Datasets | Yes | We conduct experiments on publicly available LLMs and datasets to ensure reproducibility. ... We chose mathematical reasoning datasets MATH (Hendrycks et al., 2021), competition-level mathematical problems, and GSM8K (Cobbe et al., 2021), simpler gradeschool level math word problems. |
| Dataset Splits | Yes | MATH (Hendrycks et al., 2021) ... we use the Open AI split provided by Lightman et al. (2024), which consists of 500 problems for testing and 12,500 problems for training. We further divide the training set into 11,500 problems for training and 500 problems for validation. ... GSM8K (Cobbe et al., 2021) ... It contains 1,319 problems for testing and 7,473 for training. To create a validation set, we further split the training set into 7,100 problems for training and 373 for validation. |
| Hardware Specification | Yes | For the Rho Math 1.1B model, we utilized a node with 4 Nvidia A100 80GB GPUs to train both PPO and Vine PPO. For the larger Deep Seek Math 7B model, we employed a more powerful setup, using a node with 8 Nvidia H100 80GB GPUs. Additionally, for training Deep Seek Math 7B models with the Rest EM approach, we used a node with 4 Nvidia A100 80GB GPUs. |
| Software Dependencies | No | For model implementation, we utilize the Huggingface library. Training is carried out using the Deep Speed distributed training library, which offers efficient multi-GPU support. Specifically, we employ Deep Speed Ze RO stage 0 (vanilla data parallelism) for Rho Math 1.1B and Ze RO stage 2 (shared optimizer states and gradients across GPUs) for Deep Seek Math 7B . For trajectory sampling during RL training, we rely on the v LLM library (Kwon et al., 2023), which provides optimized inference for LLMs. Additionally, Vine PPOleverages v LLM to generate Monte Carlo samples for value estimation. |
| Experiment Setup | Yes | Table 1: Summary of PPO hyperparamters used in the experiments. Parameter Value Optimizer Adam W Adam Parameters (β1, β2) (0.9, 0.999) Learning rate 1 10 6 Weight Decay 0.0 Max Global Gradient Norm for Clipping 1.0 Learning Rate Scheduler Polynomial Warm Up 3% of training steps # Train Steps For MATH dataset 1000 steps (around 8 dataset epochs) # Train Steps For GSM8K dataset 650 steps (around 8 dataset epochs) Maximum Response Length 1024 tokens Maximum Sequence Length for Rho Math 1.1B 2048 tokens Maximum Sequence Length for Deep Seek Math 7B 2500 tokens # Responses per Prompt 8 Search Space: {8, 16, 32} # Episodes per PPO Step 512 Search Space: {256, 512} # Prompts per PPO Step 512/8 = 64 Mini-batch Size 64 # Inner epochs per PPO Step 2 Search Space: {1, 2} Sampling Temperature 0.6 Search Space: {0.6, 0.8, 1.0} Discount Factor γ 1.0 GAE Parameter λ 1.0 Search Space: [0.95 1.0] KL Penalty Coefficient β 1e-4 Search Space: {1e-1, 1e-2, 3e-3, 1e-4} Policy Clipping Parameter ϵ 0.2 Value Clipping Parameter ϵ 0.2 |