Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

Authors: Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, Bo Dai

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Moreover, experiments on text summarization, dialogue, and standard benchmarks verify the practicality and effectiveness of VPO. We conduct extensive experimental studies using TL;DR and ARC-Challenge tasks as well as standard benchmarks Alpaca Eval 2.0 and MT-Bench in online and offline settings with optimistic and pessimistic bias, respectively. The results demonstrate improved empirical performance.
Researcher Affiliation Collaboration Shicong Cen CMU Jincheng Mei Google Katayoon Goshvadi Google Hanjun Dai Google Tong Yang CMU Sherry Yang Google Dale Schuurmans Google Yuejie Chi CMU Bo Dai Google
Pseudocode Yes Algorithm 1 VPO for online RLHF; Algorithm 2 VPO for offline RLHF
Open Source Code No The paper does not provide an explicit statement of code release or a link to a source code repository for the methodology described.
Open Datasets Yes We conduct extensive experimental studies using TL;DR and ARC-Challenge tasks as well as standard benchmarks Alpaca Eval 2.0 and MT-Bench in online and offline settings with optimistic and pessimistic bias, respectively. The results demonstrate improved empirical performance. We use Ultra Feedback1 (Cui et al., 2023) as our training dataset which contains around 61k preference pairs of single-turn conversations. 1https://huggingface.co/datasets/Hugging Face H4/ultrafeedback_binarized
Dataset Splits Yes To construct the preference pairs for training, we start with 1, 119 examples in the training set and generate three comparison pairs with each incorrect answer, resulting in a total of 3, 357 preference training data. We use ARC-Challenge test set which contains 1, 172 questions to test algorithms performances. We use Ultra Feedback1 (Cui et al., 2023) as our training dataset which contains around 61k preference pairs of single-turn conversations. We split the 61k prompts into four chunks and follow an iterative training approach.
Hardware Specification Yes The training for LLAMA2-13B-CHAT model on 128 TPU-v4 takes around 2hrs and for FLAN-T5-XL on 64 TPU-v3 takes 1 hour. The training of the policy, PALM2-XXS on 64 TPU-v3 for 5000 steps takes around 12 hours for both online DPO and VPO. All experiments are conducted on 16x A100 GPUs.
Software Dependencies No The paper mentions specific language models and optimization methods (e.g., Adam W) but does not provide specific version numbers for software libraries or frameworks like PyTorch, TensorFlow, or Python.
Experiment Setup Yes We set β as 0.1 in DPO and τ as 1.0 in IPO. For VPO, we experiment with moving α from 0.01 to 10, choosing 1 for the reported results. We set β as 0.1 for the DPO term similar to (Guo et al., 2024). Additionally for VPO, we decrease the coefficient exponentially following α 1+training steps. We try different values of α and report the results for 0.1 and 0.01. We approximately solve the optimization problems by performing 20 Adam W optimization steps with learning rate 0.01 and weight decay rate 0.01 in every iteration for the online setting and 1000 steps for the offline setting. We set β = 5 for the online linear contextual bandit problem and β = 1 for all other experiments to better illustrate the performance differences.