A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

Authors: Hui Yuan, Yifan Zeng, Yue Wu, Huazheng Wang, Mengdi Wang, Liu Leqi

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical implications of our framework further extend to explaining important differences in the training dynamics of various preference optimization algorithms and suggesting future directions for improvement. 1... We validate these theoretical insights empirically (Section 4.2).
Researcher Affiliation Academia Hui Yuan 1, Yifan Zeng 2, Yue Wu 3, Huazheng Wang4, Mengdi Wang5, Liu Leqi 6 1,3,5Princeton University 2,4Oregon State University 6The University of Texas at Austin
Pseudocode No The paper describes theoretical derivations and empirical validations of existing methods but does not present any new algorithms in pseudocode or algorithm block format.
Open Source Code Yes Code for the paper can be found at https://github.com/Humain Lab/Understand Margin PO.
Open Datasets Yes We conduct experiments on the TL;DR dataset (Stiennon et al., 2020) to showcase the widely-existing phenomenon that the chosen and rejected log-probabilities have synchronized changes during preference optimization. In addition, Figure 1 depicts how different margin-based preference optimization algorithms influence the log-probability of chosen and rejected responses.
Dataset Splits No The paper mentions using the TL;DR dataset and a specially curated sentiment dataset, and that log-probabilities are averaged on the evaluation set. However, it does not provide specific percentages, sample counts, or detailed methodologies for how the datasets were split into training, validation, and test sets, which is needed for reproduction.
Hardware Specification Yes The training was performed on a hardware setup consisting of two NVIDIA H100 GPUs, providing substantial computational power for the training process.
Software Dependencies Yes Our experiments were implemented using TRL version 0.11.0.
Experiment Setup Yes To optimize the training process, we applied Low-Rank Adaptation (Lo RA) with a rank of 64 to both models. The learning rate was set at 5 10 6 for all RLHF training.