reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bootstrapping Language Models with DPO Implicit Rewards

Authors: Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, Min Lin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach, named self-alignment with DPO Impli Cit r Ewards (DICE), shows great improvements in alignment. It achieves an increase of more than 8% in length-controlled win rate on Alpaca Eval 2 for all the different base models that we tried, without relying on external feedback. Our code is available at https: //github.com/sail-sg/dice. ... This section empirically investigates DICE. Our findings highlight several key points: (1) DICE significantly improves the model performance on the widely used leaderboard Alpaca Eval 2 (Li et al., 2023b), increasing length-controlled win rate by more than 8% for all the different base models; ... (3) the two proposed techniques in Sections 3.1 and 3.2 are shown to be critical for DICE; (4) DICE demonstrates competitive performance relative to scalar reward models trained exclusively on the same seed data.
Researcher Affiliation	Collaboration	Changyu Chen 1, Zichen Liu 23, Chao Du 2, Tianyu Pang2, Qian Liu2, Arunesh Sinha 4, Pradeep Varakantham 1, Min Lin2 1Singapore Management University 2Sea AI Lab, Singapore 3National University of Singapore 4Rutgers University
Pseudocode	Yes	Algorithm 1 Bootstrapping with DPO Implicit Rewards (DICE)
Open Source Code	Yes	Our code is available at https: //github.com/sail-sg/dice.
Open Datasets	Yes	Both models are trained following the pipeline of Zephyr (Tunstall et al., 2023) on the Ultra Feedback (Cui et al., 2023) dataset.
Dataset Splits	No	We randomly sample a subset of around 10k preference pairs from Ultra Feedback as the offline dataset Doffline for our fine-tuning experiments. ... In each round, we train the model for 300 steps on a preference dataset with 9.6k preference pairs (either a solely generated dataset, or a mixture of the generated dataset and the offline preference dataset). ... We evaluate our method by Alpaca Eval 2 (Li et al., 2023b) and Arena-Hard (Li et al., 2024).
Hardware Specification	Yes	All experiments are conducted on 8 Nvidia A100 GPUs.
Software Dependencies	No	The scalar reward model is trained using Open RLHF (Hu et al., 2024) framework, adhering to the recommended training parameters5.
Experiment Setup	Yes	Response Generation and Dataset Construction. At the start of each round, we sample responses from the current policy, with temperature T = 0.9, p = 1.0 for the Llama3 setting and T = 0.7, p = 0.9 for the Zephyr setting. We sample with different random seeds to get K = 16 diverse responses for each prompt. ... Training Details. All experiments are conducted on 8 Nvidia A100 GPUs. For DICE, we trained two rounds in total. In each round, we train the model for 300 steps on a preference dataset with 9.6k preference pairs (either a solely generated dataset, or a mixture of the generated dataset and the offline preference dataset). The global training batch size is set to 32 and the learning rate is 5e-7 with a constant schedule and a warm-up of 50 steps. We hypertuned β {0.01, 0.1} based on the model performance on Alpaca Eval 2 for each method and model separately. For our approach, we additionally hypertuned the experience replay ratio γ using cross-validation to ensure fair assessment.