reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

Authors: Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, Bo Dai

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Moreover, experiments on text summarization, dialogue, and standard benchmarks verify the practicality and effectiveness of VPO. We conduct extensive experimental studies using TL;DR and ARC-Challenge tasks as well as standard benchmarks Alpaca Eval 2.0 and MT-Bench in online and offline settings with optimistic and pessimistic bias, respectively. The results demonstrate improved empirical performance.
Researcher Affiliation	Collaboration	Shicong Cen CMU Jincheng Mei Google Katayoon Goshvadi Google Hanjun Dai Google Tong Yang CMU Sherry Yang Google Dale Schuurmans Google Yuejie Chi CMU Bo Dai Google
Pseudocode	Yes	Algorithm 1 VPO for online RLHF; Algorithm 2 VPO for offline RLHF
Open Source Code	No	The paper does not provide an explicit statement of code release or a link to a source code repository for the methodology described.
Open Datasets	Yes	We conduct extensive experimental studies using TL;DR and ARC-Challenge tasks as well as standard benchmarks Alpaca Eval 2.0 and MT-Bench in online and offline settings with optimistic and pessimistic bias, respectively. The results demonstrate improved empirical performance. We use Ultra Feedback1 (Cui et al., 2023) as our training dataset which contains around 61k preference pairs of single-turn conversations. 1https://huggingface.co/datasets/Hugging Face H4/ultrafeedback_binarized
Dataset Splits	Yes	To construct the preference pairs for training, we start with 1, 119 examples in the training set and generate three comparison pairs with each incorrect answer, resulting in a total of 3, 357 preference training data. We use ARC-Challenge test set which contains 1, 172 questions to test algorithms performances. We use Ultra Feedback1 (Cui et al., 2023) as our training dataset which contains around 61k preference pairs of single-turn conversations. We split the 61k prompts into four chunks and follow an iterative training approach.
Hardware Specification	Yes	The training for LLAMA2-13B-CHAT model on 128 TPU-v4 takes around 2hrs and for FLAN-T5-XL on 64 TPU-v3 takes 1 hour. The training of the policy, PALM2-XXS on 64 TPU-v3 for 5000 steps takes around 12 hours for both online DPO and VPO. All experiments are conducted on 16x A100 GPUs.
Software Dependencies	No	The paper mentions specific language models and optimization methods (e.g., Adam W) but does not provide specific version numbers for software libraries or frameworks like PyTorch, TensorFlow, or Python.
Experiment Setup	Yes	We set β as 0.1 in DPO and τ as 1.0 in IPO. For VPO, we experiment with moving α from 0.01 to 10, choosing 1 for the reported results. We set β as 0.1 for the DPO term similar to (Guo et al., 2024). Additionally for VPO, we decrease the coefficient exponentially following α 1+training steps. We try different values of α and report the results for 0.1 and 0.01. We approximately solve the optimization problems by performing 20 Adam W optimization steps with learning rate 0.01 and weight decay rate 0.01 in every iteration for the online setting and 1000 steps for the offline setting. We set β = 5 for the online linear contextual bandit problem and β = 1 for all other experiments to better illustrate the performance differences.