AMPO: Active Multi Preference Optimization for Self-play Preference Selection

Authors: Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, AMPO achieves state-of-the-art results on Alpaca Eval with Llama 8B and Mistral 7B. We release our datasets here. (Section 1) Empirically, AMPO sets a new benchmark on Alpaca Eval with Llama 8B, surpassing strong baselines like SIMPO by focusing on a small but strategically chosen set of responses each iteration (see Section 8).
Researcher Affiliation Collaboration 1Microsoft 2IISc, Bangalore. Correspondence to: Taneesh Gupta <EMAIL>, Rahul Madhavan <EMAIL>.
Pseudocode Yes Algorithm 1 AMPO: One-Positive vs. K-Active Negatives Algorithm 2 AMPO-CORESET via k-means Algorithm 3 AMPO-OPTSELECT via Solving MIP
Open Source Code No Dataset Releases: We publicly release our AMPO-Coreset-Selection and AMPO-Opt-Selection datasets on Hugging Face. These contain curated response subsets for each prompt, facilitating research on multi-preference alignment. The paper states that datasets are released, but does not explicitly mention the release of source code for the methodology.
Open Datasets Yes Dataset Releases: We publicly release our AMPO-Coreset-Selection and AMPO-Opt-Selection datasets on Hugging Face. These contain curated response subsets for each prompt, facilitating research on multi-preference alignment. We utilize prompts from the Ultra Feedback dataset (Cui et al., 2023).
Dataset Splits No The paper describes generating responses and selecting subsets for training (e.g., "For each prompt x, we produce 32 responses by sampling from the SFT model...") and mentions using prompts from the Ultra Feedback dataset. However, it does not provide specific training, validation, or test dataset splits in terms of percentages, absolute counts, or references to predefined splits for its own experimental setup.
Hardware Specification No The paper does not explicitly mention any specific hardware components such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using specific models like "metallama/Meta Llama-3-8B-Instruct" and "Skywork/Skywork-Reward Llama-3.1-8B-v0.2", but it does not specify software dependencies like programming languages, libraries, or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For each prompt x, we produce 32 responses by sampling from the SFT model with a sampling temperature of 0.8. We found that setting the β (inverse temperature) parameter in the range of 5.0 to 10.0 consistently yields strong performance, while tuning the γ parameter within the range of 2 to 4 further improved performance. To examine how the number of negative comparisons affects performance, we evaluate AMPO-Opt-Select with increasing values of K in the 1-vs-K selection strategy—specifically, K = 3, 5, 7.