MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

Authors: Tianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, Linjun Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs. As shown in Figure 3, we validate our approach by aligning sentiment and conciseness on LLa MA 3.2-3B (Dubey et al., 2024). To assess scalability and robustness, we extend MPO to optimize three objectives in the Helpful Assistant task (Bai et al., 2022) and conduct comparative evaluations against previous approaches using Qwen 2.5-7B (Qwen Team, 2024). Experimental findings show that MPO achieves comparable, if not superior, performance to Max Min-RLHF while significantly reducing computational overhead.
Researcher Affiliation Academia 1 Department of Statistics, Rutgers University, New Brunswick, United States 2 College of Management of Technology, EPFL, Switzerland 3 Department of Computer Science, ETH Zurich, Switzerland 4 Department of Computer Science, Rutgers University, New Brunswick, United States. Correspondence to: Linjun Zhang <EMAIL>.
Pseudocode Yes Algorithm 1 MPO: Post-processing Algorithm for Diverse Preference Alignment Algorithm 2 Coefficient Optimization using Batch Stochastic Mirror Descent
Open Source Code No The paper mentions using specific open-source tools and frameworks such as 'unsloth (Daniel Han & team, 2023)' and 'trl (von Werra et al., 2020)', and provides a link to the TRL library: 'https://github.com/huggingface/trl'. However, it does not explicitly state that the authors are releasing their own implementation code for the MPO methodology described in this paper.
Open Datasets Yes The controlled sentiment generation task on the IMDb dataset (Maas et al., 2011) aims to learn an optimal policy that balances positive sentiment and conciseness in generating movie reviews. The HH-RLHF dataset (Bai et al., 2022), which contains dialogues with human-annotated preference labels for AI-generated responses, is divided into three equal-sized subsets: Dhelpful, Dharmless, and Dhumorous.
Dataset Splits No The paper states, 'For this experiment, we split the dataset into two preference subsets: D1 prioritizes positive sentiment, while D2 favors conciseness (fewer tokens).' and 'The HH-RLHF dataset (Bai et al., 2022)... is divided into three equal-sized subsets: Dhelpful, Dharmless, and Dhumorous.' It also mentions 'evaluation set Xeval'. However, it does not provide explicit training, validation, and test dataset splits (e.g., percentages, sample counts, or references to standard benchmark splits for these purposes) required for full reproducibility of experiment partitioning.
Hardware Specification Yes Hardware NVIDIA A100 40 GB
Software Dependencies No The paper mentions several software components and tools used, such as 'Pre-trained language model LLa MA 3.2-3B', 'Qwen 2.5-7B', 'Implementation unsloth (Daniel Han & team, 2023)', and 'RL algorithm PPO (Schulman et al., 2017) Implementation trl (von Werra et al., 2020)'. However, specific version numbers for general software dependencies like Python, PyTorch, CUDA, or the mentioned frameworks (unsloth, trl) are not provided.
Experiment Setup Yes Learning rate 1e-5 Optimizer Adam Inference tokens for evaluation 128 Temperature 0.5 β 0.1 for Sentiment and Conciseness 0.1 or 0.5 for Helpful Assistant DPO inner epochs 2 for Sentiment and Conciseness 4 for Helpful Assistant PPO inner epochs 4 Discount γ 1 GAE parameter λ 0.95 Cliprange 0.2 Batch Stochastic Mirror Descent stepsize η 0.02 batch size m 40