reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment

Authors: Mingzhi Wang, Chengdong Ma, Qizhi Chen, Linjian Meng, Yang Han, Jiancong Xiao, Zhaowei Zhang, Jing Huo, Weijie Su, Yaodong Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct comprehensive experiments to validate the effectiveness of our proposed MPO algorithm. We start by focusing on safety as the primary alignment metric. ... Empirical results demonstrate that MPO can significantly enhance the performance of LLMs, highlighting the potential of self-play methods in alignment. ... We present our cost model evaluation results in Table 1, where lower cost means safer outputs. The GPT-4o evaluation results, shown in Table 2, reveal that our method substantially improves the model’s win rate compared to the SFT model. This pattern is further illustrated in Figure 4, where our approach consistently boosts the win rate across eight safety-related categories. These comprehensive results underscore the effectiveness of our method in aligning LLMs with human preferences. Additionally, we also investigate the case where self-play is omitted, and the results show significantly poorer performance compared to the self-play setting. ... To further evaluate the effectiveness of each individual component of MPO, we conduct an ablation study shown in Figure 5.
Researcher Affiliation	Collaboration	1Institute for Artificial Intelligence, Peking University 2Beijing Academy of Artificial Intelligence 3National Key Laboratory for Novel Software Technology, Nanjing University 4China Telecom, 5University of Pennsylvania
Pseudocode	Yes	The pseudocode of MPO is provided in Algorithm 1. ... The pseudocode of MPO-RT is provided in Algorithm 2.
Open Source Code	No	The paper mentions using and implementing based on 'official codebase of Safe-RLHF' (Dai et al., 2023), 'official codebase of Dong et al., 2024', and 'implementation provided in Open RLHF (Hu et al., 2024)'. These refer to third-party codebases used by the authors, rather than the authors explicitly stating they are releasing their own source code for MPO.
Open Datasets	Yes	Following the methodology of (Dai et al., 2023), we first perform supervised fine-tuning on the Alpaca (Taori et al., 2023) dataset, ... train our preference model on a mixture of widely-used open-source preference datasets1, ... prompts sourced from the PKU-Safe RLHF (Ji et al., 2024) dataset and the HH-Harmless section of the Anthropic Helpful and Harmless dialogue (HH) (Bai et al., 2022) dataset. ... SFT-Open Hermes-2.5-Standard2 dataset. The obtained SFT model serves a good foundation for our experiments. Next, we train a preference model based on a mixture of open-source preference dataset3. We then fine-tune the SFT model using 30K prompts selected from a collection of Ultra Feedback (Cui et al., 2023), Help Steer (Wang et al., 2023), Open Orca (Lian et al., 2023), Ultra Interact (Yuan et al., 2024), Capybara (Daniele & Suphavadeepprasit, 2023) datasets for four rounds of self-play. ... 1https://huggingface.co/datasets/weqweasdas/preference_datase_mixture2_and_safe_pku ... 2https://huggingface.co/datasets/RLHFlow/SFT-Open Hermes-2.5-Standard ... 3https://huggingface.co/datasets/hendrydong/preference_700K ... 13https://huggingface.co/datasets/Open RLHF/prompt-collection-v0.1
Dataset Splits	No	The paper states: 'These prompts are equally divided and used over three rounds of self-play.' and 'These prompts are evenly split across four rounds of self-play.' This describes how the prompts are distributed across self-play rounds, but it does not specify explicit training, validation, or test splits for the datasets themselves (e.g., percentages or exact counts for each split used to train or evaluate the models). The selection of '30K prompts' is mentioned, but not their division into standard data splits.
Hardware Specification	Yes	These experiments are conducted on an 8 A800-40GB GPU server. (Appendix B) ... These experiments are conducted on a 8 A800-80GB GPU server. (Appendix C)
Software Dependencies	No	The paper refers to using 'official codebase' for Safe-RLHF and 'implementation provided in Open RLHF' but does not specify version numbers for any software libraries (e.g., PyTorch, Hugging Face Transformers) or programming languages (e.g., Python). Table 4 lists 'adam_torch_fused' as an optimizer but does not provide its version.
Experiment Setup	Yes	The hyper-parameters used during training are listed in Table 4. ... The hyper-parameters used during SFT training process are presented in Table 6. ... The hyper-parameters used for training on both datasets are detailed in Table 7. ... For PPO, we follow the official default hyper-parameters for PPO, with a few adjustments to align the settings with those of MPO. Specifically, we set the actor learning rate to 5e-7, the max length to 1024, the batch size to 64, the ptx coefficient to 0 and the critic learning rate to 9e-6. For Iterative DPO, we use the implementation provided in Open RLHF (Hu et al., 2024). The default hyper-parameters are used, with the max length adjusted to 1024 to match the MPO setup.