A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement
Authors: Hui Yuan, Yifan Zeng, Yue Wu, Huazheng Wang, Mengdi Wang, Liu Leqi
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical implications of our framework further extend to explaining important differences in the training dynamics of various preference optimization algorithms and suggesting future directions for improvement. 1... We validate these theoretical insights empirically (Section 4.2). |
| Researcher Affiliation | Academia | Hui Yuan 1, Yifan Zeng 2, Yue Wu 3, Huazheng Wang4, Mengdi Wang5, Liu Leqi 6 1,3,5Princeton University 2,4Oregon State University 6The University of Texas at Austin |
| Pseudocode | No | The paper describes theoretical derivations and empirical validations of existing methods but does not present any new algorithms in pseudocode or algorithm block format. |
| Open Source Code | Yes | Code for the paper can be found at https://github.com/Humain Lab/Understand Margin PO. |
| Open Datasets | Yes | We conduct experiments on the TL;DR dataset (Stiennon et al., 2020) to showcase the widely-existing phenomenon that the chosen and rejected log-probabilities have synchronized changes during preference optimization. In addition, Figure 1 depicts how different margin-based preference optimization algorithms influence the log-probability of chosen and rejected responses. |
| Dataset Splits | No | The paper mentions using the TL;DR dataset and a specially curated sentiment dataset, and that log-probabilities are averaged on the evaluation set. However, it does not provide specific percentages, sample counts, or detailed methodologies for how the datasets were split into training, validation, and test sets, which is needed for reproduction. |
| Hardware Specification | Yes | The training was performed on a hardware setup consisting of two NVIDIA H100 GPUs, providing substantial computational power for the training process. |
| Software Dependencies | Yes | Our experiments were implemented using TRL version 0.11.0. |
| Experiment Setup | Yes | To optimize the training process, we applied Low-Rank Adaptation (Lo RA) with a rank of 64 to both models. The learning rate was set at 5 10 6 for all RLHF training. |