Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment

Authors: Mingzhi Wang, Chengdong Ma, Qizhi Chen, Linjian Meng, Yang Han, Jiancong Xiao, Zhaowei Zhang, Jing Huo, Weijie Su, Yaodong Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct comprehensive experiments to validate the effectiveness of our proposed MPO algorithm. We start by focusing on safety as the primary alignment metric. ... Empirical results demonstrate that MPO can significantly enhance the performance of LLMs, highlighting the potential of self-play methods in alignment. ... We present our cost model evaluation results in Table 1, where lower cost means safer outputs. The GPT-4o evaluation results, shown in Table 2, reveal that our method substantially improves the model’s win rate compared to the SFT model. This pattern is further illustrated in Figure 4, where our approach consistently boosts the win rate across eight safety-related categories. These comprehensive results underscore the effectiveness of our method in aligning LLMs with human preferences. Additionally, we also investigate the case where self-play is omitted, and the results show significantly poorer performance compared to the self-play setting. ... To further evaluate the effectiveness of each individual component of MPO, we conduct an ablation study shown in Figure 5.
Researcher Affiliation Collaboration 1Institute for Artificial Intelligence, Peking University 2Beijing Academy of Artificial Intelligence 3National Key Laboratory for Novel Software Technology, Nanjing University 4China Telecom, 5University of Pennsylvania
Pseudocode Yes The pseudocode of MPO is provided in Algorithm 1. ... The pseudocode of MPO-RT is provided in Algorithm 2.
Open Source Code No The paper mentions using and implementing based on 'official codebase of Safe-RLHF' (Dai et al., 2023), 'official codebase of Dong et al., 2024', and 'implementation provided in Open RLHF (Hu et al., 2024)'. These refer to third-party codebases used by the authors, rather than the authors explicitly stating they are releasing their own source code for MPO.
Open Datasets Yes Following the methodology of (Dai et al., 2023), we first perform supervised fine-tuning on the Alpaca (Taori et al., 2023) dataset, ... train our preference model on a mixture of widely-used open-source preference datasets1, ... prompts sourced from the PKU-Safe RLHF (Ji et al., 2024) dataset and the HH-Harmless section of the Anthropic Helpful and Harmless dialogue (HH) (Bai et al., 2022) dataset. ... SFT-Open Hermes-2.5-Standard2 dataset. The obtained SFT model serves a good foundation for our experiments. Next, we train a preference model based on a mixture of open-source preference dataset3. We then fine-tune the SFT model using 30K prompts selected from a collection of Ultra Feedback (Cui et al., 2023), Help Steer (Wang et al., 2023), Open Orca (Lian et al., 2023), Ultra Interact (Yuan et al., 2024), Capybara (Daniele & Suphavadeepprasit, 2023) datasets for four rounds of self-play. ... 1https://huggingface.co/datasets/weqweasdas/preference_datase_mixture2_and_safe_pku ... 2https://huggingface.co/datasets/RLHFlow/SFT-Open Hermes-2.5-Standard ... 3https://huggingface.co/datasets/hendrydong/preference_700K ... 13https://huggingface.co/datasets/Open RLHF/prompt-collection-v0.1
Dataset Splits No The paper states: 'These prompts are equally divided and used over three rounds of self-play.' and 'These prompts are evenly split across four rounds of self-play.' This describes how the prompts are distributed across self-play rounds, but it does not specify explicit training, validation, or test splits for the datasets themselves (e.g., percentages or exact counts for each split used to train or evaluate the models). The selection of '30K prompts' is mentioned, but not their division into standard data splits.
Hardware Specification Yes These experiments are conducted on an 8 A800-40GB GPU server. (Appendix B) ... These experiments are conducted on a 8 A800-80GB GPU server. (Appendix C)
Software Dependencies No The paper refers to using 'official codebase' for Safe-RLHF and 'implementation provided in Open RLHF' but does not specify version numbers for any software libraries (e.g., PyTorch, Hugging Face Transformers) or programming languages (e.g., Python). Table 4 lists 'adam_torch_fused' as an optimizer but does not provide its version.
Experiment Setup Yes The hyper-parameters used during training are listed in Table 4. ... The hyper-parameters used during SFT training process are presented in Table 6. ... The hyper-parameters used for training on both datasets are detailed in Table 7. ... For PPO, we follow the official default hyper-parameters for PPO, with a few adjustments to align the settings with those of MPO. Specifically, we set the actor learning rate to 5e-7, the max length to 1024, the batch size to 64, the ptx coefficient to 0 and the critic learning rate to 9e-6. For Iterative DPO, we use the implementation provided in Open RLHF (Hu et al., 2024). The default hyper-parameters are used, with the max length adjusted to 1024 to match the MPO setup.