reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Authors: Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, Boris Hanin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through theory and experiments, we characterize mechanisms driving likelihood displacement, demonstrate that it can lead to surprising failures in alignment, and provide preventative guidelines. Our experiments cover models of different families and scales, including OLMo-1B (Groeneveld et al., 2024), Gemma-2B (Team et al., 2024), and Llama-3-8B (Dubey et al., 2024). Section 3 demonstrates that likelihood displacement can occur and be catastrophic independently of these factors, even when training over just a single prompt whose responses contain a single token each. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset.
Researcher Affiliation	Academia	Noam Razin , Sadhika Malladi , Adithya Bhaskar , Danqi Chen , Sanjeev Arora , Boris Hanin Princeton Language and Intelligence, Princeton University Princeton ORFE
Pseudocode	No	The paper describes theoretical analysis through theorems and proofs (e.g., Theorem 1, Theorem 4, Theorem 6), but does not contain explicitly labeled pseudocode or algorithm blocks. It outlines technical approaches and derivations rather than procedural steps formatted as code.
Open Source Code	Yes	Our code is available at https://github.com/princeton-nlp/unintentional-unalignment.
Open Datasets	Yes	The experiments are based on the Persona dataset (Perez et al., 2022)... We use the Ultra Feedback and Alpaca Farm datasets and the OLMo-1B, Gemma-2B, and Llama-3-8B models... We used the base portion of SORRY-Bench (Xie et al., 2024b).
Dataset Splits	Yes	Lastly, we partition the datasets into training and test sets according to a 85%/15% split, and train the language models via DPO over their respective training sets.
Hardware Specification	Yes	Hardware. Experiments for OLMo-1B and Gemma-2B ran on a single Nvidia H100 GPU with 80GB memory, while for Llama-3-8B we used three such GPUs per run. (Section K.1)... Hardware. Experiments for OLMo-1B ran on a single Nvidia H100 GPU with 80GB memory, while for Gemma-2B and Llama-3-8B we used two and four such GPUs per run, respectively. (Section K.2)... Hardware. Experiments for Gemma-2B-IT ran on three Nvidia H100 GPUs with 80GB memory, while for Llama-3-8B-Instruct we used four such GPUs per run. (Section K.3)
Software Dependencies	No	Code for reproducing our results, based on the Py Torch (Paszke et al., 2017) and Hugging Face (Wolf et al., 2019) frameworks, can be found at https://github.com/princeton-nlp/unintentional-unalignment. While the frameworks are mentioned, specific version numbers for PyTorch and Hugging Face Transformers are not provided.
Experiment Setup	Yes	In the initial SFT phase, we minimized the cross entropy loss over all 1000 prompts for one epoch, using the RMSProp optimizer (Hinton et al., 2012) with a learning rate of 1e-7 and batch size of 32. For DPO, we performed 100 training steps using the RMSProp optimizer over a single prompt in each run, with a learning rate of 1e-7, and set the KL coefficient to 0.1, in line with Rafailov et al. (2023); Tajwar et al. (2024); Xu et al. (2024b); Dubey et al. (2024).