reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

Authors: Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jiawei Chen, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations demonstrate that Dr. DPO substantially improves the quality of generated text and response accuracy in preference datasets, showcasing enhanced performance in both noisy and noise-free settings.
Researcher Affiliation	Collaboration	1University of Science and Technology of China 2Alibaba Group 3Zhejiang University 4Mo E Key Lab of BIPC, University of Science and Technology of China EMAIL
Pseudocode	Yes	Figure 9: Pseudocode for our proposed Dr. DPO, as well as the original DPO objective.
Open Source Code	Yes	The code is available at https://github.com/junkangwu/Dr_DPO.
Open Datasets	Yes	We conduct experiments on two datasets: IMDB (Maas et al., 2011) and Anthropic HH (Bai et al., 2022).
Dataset Splits	No	The paper mentions training and test sets and introduces noise into the training data, for example, 'To test the model s resilience to noise, we introduced random inversions between selected and rejected responses in the training data at varying noise levels specifically, with probabilities of 10%, 20%, 30%, and 40%.' and 'The Win-Rate computation is specifically designed for the single-turn dialogue portion of HH dataset s test subset.' However, it does not explicitly provide specific percentages, sample counts, or citations to predefined train/test/validation splits for the datasets used.
Hardware Specification	Yes	We carried out all computational tasks on a suite of four 80GB A100 GPUs.
Software Dependencies	No	The paper mentions using 'Pythia 2.8B model' and 'GPT-2-large' and 'Si EBERT', but does not provide specific version numbers for the underlying software libraries or dependencies like PyTorch, TensorFlow, or Python.
Experiment Setup	Yes	Our training regimen was in line with the DPO-established protocol (Rafailov et al., 2023a). We built upon the Pythia 2.8B model, as described in (Biderman et al., 2023), to develop our Supervised Fine-Tuning (SFT) model. The SFT model was fine-tuned on the Anthropic HH dataset over the course of one epoch, employing a batch size of 64 and a learning rate of 5 10 7. In addition, we further refined the model using the Anthropic HH dataset and the DPO loss function (or other baseline approaches) through an additional epoch of fine-tuning. To test the model s resilience to noise, we introduced random inversions between selected and rejected responses in the training data at varying noise levels specifically, with probabilities of 10%, 20%, 30%, and 40%. Throughout these experiments, we consistently set the β parameter to 0.1 and adopted the Kullback-Leibler (KL) divergence as the metric for ϕ-divergence.