Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF
Authors: Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, Bo Dai
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Moreover, experiments on text summarization, dialogue, and standard benchmarks verify the practicality and effectiveness of VPO. We conduct extensive experimental studies using TL;DR and ARC-Challenge tasks as well as standard benchmarks Alpaca Eval 2.0 and MT-Bench in online and offline settings with optimistic and pessimistic bias, respectively. The results demonstrate improved empirical performance. |
| Researcher Affiliation | Collaboration | Shicong Cen CMU Jincheng Mei Google Katayoon Goshvadi Google Hanjun Dai Google Tong Yang CMU Sherry Yang Google Dale Schuurmans Google Yuejie Chi CMU Bo Dai Google |
| Pseudocode | Yes | Algorithm 1 VPO for online RLHF; Algorithm 2 VPO for offline RLHF |
| Open Source Code | No | The paper does not provide an explicit statement of code release or a link to a source code repository for the methodology described. |
| Open Datasets | Yes | We conduct extensive experimental studies using TL;DR and ARC-Challenge tasks as well as standard benchmarks Alpaca Eval 2.0 and MT-Bench in online and offline settings with optimistic and pessimistic bias, respectively. The results demonstrate improved empirical performance. We use Ultra Feedback1 (Cui et al., 2023) as our training dataset which contains around 61k preference pairs of single-turn conversations. 1https://huggingface.co/datasets/Hugging Face H4/ultrafeedback_binarized |
| Dataset Splits | Yes | To construct the preference pairs for training, we start with 1, 119 examples in the training set and generate three comparison pairs with each incorrect answer, resulting in a total of 3, 357 preference training data. We use ARC-Challenge test set which contains 1, 172 questions to test algorithms performances. We use Ultra Feedback1 (Cui et al., 2023) as our training dataset which contains around 61k preference pairs of single-turn conversations. We split the 61k prompts into four chunks and follow an iterative training approach. |
| Hardware Specification | Yes | The training for LLAMA2-13B-CHAT model on 128 TPU-v4 takes around 2hrs and for FLAN-T5-XL on 64 TPU-v3 takes 1 hour. The training of the policy, PALM2-XXS on 64 TPU-v3 for 5000 steps takes around 12 hours for both online DPO and VPO. All experiments are conducted on 16x A100 GPUs. |
| Software Dependencies | No | The paper mentions specific language models and optimization methods (e.g., Adam W) but does not provide specific version numbers for software libraries or frameworks like PyTorch, TensorFlow, or Python. |
| Experiment Setup | Yes | We set β as 0.1 in DPO and τ as 1.0 in IPO. For VPO, we experiment with moving α from 0.01 to 10, choosing 1 for the reported results. We set β as 0.1 for the DPO term similar to (Guo et al., 2024). Additionally for VPO, we decrease the coefficient exponentially following α 1+training steps. We try different values of α and report the results for 0.1 and 0.01. We approximately solve the optimization problems by performing 20 Adam W optimization steps with learning rate 0.01 and weight decay rate 0.01 in every iteration for the online setting and 1000 steps for the offline setting. We set β = 5 for the online linear contextual bandit problem and β = 1 for all other experiments to better illustrate the performance differences. |