reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

Authors: Tianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, Linjun Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs. As shown in Figure 3, we validate our approach by aligning sentiment and conciseness on LLa MA 3.2-3B (Dubey et al., 2024). To assess scalability and robustness, we extend MPO to optimize three objectives in the Helpful Assistant task (Bai et al., 2022) and conduct comparative evaluations against previous approaches using Qwen 2.5-7B (Qwen Team, 2024). Experimental findings show that MPO achieves comparable, if not superior, performance to Max Min-RLHF while significantly reducing computational overhead.
Researcher Affiliation	Academia	1 Department of Statistics, Rutgers University, New Brunswick, United States 2 College of Management of Technology, EPFL, Switzerland 3 Department of Computer Science, ETH Zurich, Switzerland 4 Department of Computer Science, Rutgers University, New Brunswick, United States. Correspondence to: Linjun Zhang <EMAIL>.
Pseudocode	Yes	Algorithm 1 MPO: Post-processing Algorithm for Diverse Preference Alignment Algorithm 2 Coefficient Optimization using Batch Stochastic Mirror Descent
Open Source Code	No	The paper mentions using specific open-source tools and frameworks such as 'unsloth (Daniel Han & team, 2023)' and 'trl (von Werra et al., 2020)', and provides a link to the TRL library: 'https://github.com/huggingface/trl'. However, it does not explicitly state that the authors are releasing their own implementation code for the MPO methodology described in this paper.
Open Datasets	Yes	The controlled sentiment generation task on the IMDb dataset (Maas et al., 2011) aims to learn an optimal policy that balances positive sentiment and conciseness in generating movie reviews. The HH-RLHF dataset (Bai et al., 2022), which contains dialogues with human-annotated preference labels for AI-generated responses, is divided into three equal-sized subsets: Dhelpful, Dharmless, and Dhumorous.
Dataset Splits	No	The paper states, 'For this experiment, we split the dataset into two preference subsets: D1 prioritizes positive sentiment, while D2 favors conciseness (fewer tokens).' and 'The HH-RLHF dataset (Bai et al., 2022)... is divided into three equal-sized subsets: Dhelpful, Dharmless, and Dhumorous.' It also mentions 'evaluation set Xeval'. However, it does not provide explicit training, validation, and test dataset splits (e.g., percentages, sample counts, or references to standard benchmark splits for these purposes) required for full reproducibility of experiment partitioning.
Hardware Specification	Yes	Hardware NVIDIA A100 40 GB
Software Dependencies	No	The paper mentions several software components and tools used, such as 'Pre-trained language model LLa MA 3.2-3B', 'Qwen 2.5-7B', 'Implementation unsloth (Daniel Han & team, 2023)', and 'RL algorithm PPO (Schulman et al., 2017) Implementation trl (von Werra et al., 2020)'. However, specific version numbers for general software dependencies like Python, PyTorch, CUDA, or the mentioned frameworks (unsloth, trl) are not provided.
Experiment Setup	Yes	Learning rate 1e-5 Optimizer Adam Inference tokens for evaluation 128 Temperature 0.5 β 0.1 for Sentiment and Conciseness 0.1 or 0.5 for Helpful Assistant DPO inner epochs 2 for Sentiment and Conciseness 4 for Helpful Assistant PPO inner epochs 4 Discount γ 1 GAE parameter λ 0.95 Cliprange 0.2 Batch Stochastic Mirror Descent stepsize η 0.02 batch size m 40