Bootstrapping Language Models with DPO Implicit Rewards

Authors: Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, Min Lin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach, named self-alignment with DPO Impli Cit r Ewards (DICE), shows great improvements in alignment. It achieves an increase of more than 8% in length-controlled win rate on Alpaca Eval 2 for all the different base models that we tried, without relying on external feedback. Our code is available at https: //github.com/sail-sg/dice. ... This section empirically investigates DICE. Our findings highlight several key points: (1) DICE significantly improves the model performance on the widely used leaderboard Alpaca Eval 2 (Li et al., 2023b), increasing length-controlled win rate by more than 8% for all the different base models; ... (3) the two proposed techniques in Sections 3.1 and 3.2 are shown to be critical for DICE; (4) DICE demonstrates competitive performance relative to scalar reward models trained exclusively on the same seed data.
Researcher Affiliation Collaboration Changyu Chen 1, Zichen Liu 23, Chao Du 2, Tianyu Pang2, Qian Liu2, Arunesh Sinha 4, Pradeep Varakantham 1, Min Lin2 1Singapore Management University 2Sea AI Lab, Singapore 3National University of Singapore 4Rutgers University
Pseudocode Yes Algorithm 1 Bootstrapping with DPO Implicit Rewards (DICE)
Open Source Code Yes Our code is available at https: //github.com/sail-sg/dice.
Open Datasets Yes Both models are trained following the pipeline of Zephyr (Tunstall et al., 2023) on the Ultra Feedback (Cui et al., 2023) dataset.
Dataset Splits No We randomly sample a subset of around 10k preference pairs from Ultra Feedback as the offline dataset Doffline for our fine-tuning experiments. ... In each round, we train the model for 300 steps on a preference dataset with 9.6k preference pairs (either a solely generated dataset, or a mixture of the generated dataset and the offline preference dataset). ... We evaluate our method by Alpaca Eval 2 (Li et al., 2023b) and Arena-Hard (Li et al., 2024).
Hardware Specification Yes All experiments are conducted on 8 Nvidia A100 GPUs.
Software Dependencies No The scalar reward model is trained using Open RLHF (Hu et al., 2024) framework, adhering to the recommended training parameters5.
Experiment Setup Yes Response Generation and Dataset Construction. At the start of each round, we sample responses from the current policy, with temperature T = 0.9, p = 1.0 for the Llama3 setting and T = 0.7, p = 0.9 for the Zephyr setting. We sample with different random seeds to get K = 16 diverse responses for each prompt. ... Training Details. All experiments are conducted on 8 Nvidia A100 GPUs. For DICE, we trained two rounds in total. In each round, we train the model for 300 steps on a preference dataset with 9.6k preference pairs (either a solely generated dataset, or a mixture of the generated dataset and the offline preference dataset). The global training batch size is set to 32 and the learning rate is 5e-7 with a constant schedule and a warm-up of 50 steps. We hypertuned β {0.01, 0.1} based on the model performance on Alpaca Eval 2 for each method and model separately. For our approach, we additionally hypertuned the experience replay ratio γ using cross-validation to ensure fair assessment.