Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Authors: Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, Boris Hanin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through theory and experiments, we characterize mechanisms driving likelihood displacement, demonstrate that it can lead to surprising failures in alignment, and provide preventative guidelines. Our experiments cover models of different families and scales, including OLMo-1B (Groeneveld et al., 2024), Gemma-2B (Team et al., 2024), and Llama-3-8B (Dubey et al., 2024). Section 3 demonstrates that likelihood displacement can occur and be catastrophic independently of these factors, even when training over just a single prompt whose responses contain a single token each. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. |
| Researcher Affiliation | Academia | Noam Razin , Sadhika Malladi , Adithya Bhaskar , Danqi Chen , Sanjeev Arora , Boris Hanin Princeton Language and Intelligence, Princeton University Princeton ORFE |
| Pseudocode | No | The paper describes theoretical analysis through theorems and proofs (e.g., Theorem 1, Theorem 4, Theorem 6), but does not contain explicitly labeled pseudocode or algorithm blocks. It outlines technical approaches and derivations rather than procedural steps formatted as code. |
| Open Source Code | Yes | Our code is available at https://github.com/princeton-nlp/unintentional-unalignment. |
| Open Datasets | Yes | The experiments are based on the Persona dataset (Perez et al., 2022)... We use the Ultra Feedback and Alpaca Farm datasets and the OLMo-1B, Gemma-2B, and Llama-3-8B models... We used the base portion of SORRY-Bench (Xie et al., 2024b). |
| Dataset Splits | Yes | Lastly, we partition the datasets into training and test sets according to a 85%/15% split, and train the language models via DPO over their respective training sets. |
| Hardware Specification | Yes | Hardware. Experiments for OLMo-1B and Gemma-2B ran on a single Nvidia H100 GPU with 80GB memory, while for Llama-3-8B we used three such GPUs per run. (Section K.1)... Hardware. Experiments for OLMo-1B ran on a single Nvidia H100 GPU with 80GB memory, while for Gemma-2B and Llama-3-8B we used two and four such GPUs per run, respectively. (Section K.2)... Hardware. Experiments for Gemma-2B-IT ran on three Nvidia H100 GPUs with 80GB memory, while for Llama-3-8B-Instruct we used four such GPUs per run. (Section K.3) |
| Software Dependencies | No | Code for reproducing our results, based on the Py Torch (Paszke et al., 2017) and Hugging Face (Wolf et al., 2019) frameworks, can be found at https://github.com/princeton-nlp/unintentional-unalignment. While the frameworks are mentioned, specific version numbers for PyTorch and Hugging Face Transformers are not provided. |
| Experiment Setup | Yes | In the initial SFT phase, we minimized the cross entropy loss over all 1000 prompts for one epoch, using the RMSProp optimizer (Hinton et al., 2012) with a learning rate of 1e-7 and batch size of 32. For DPO, we performed 100 training steps using the RMSProp optimizer over a single prompt in each run, with a learning rate of 1e-7, and set the KL coefficient to 0.1, in line with Rafailov et al. (2023); Tajwar et al. (2024); Xu et al. (2024b); Dubey et al. (2024). |