reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AMPO: Active Multi Preference Optimization for Self-play Preference Selection

Authors: Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, AMPO achieves state-of-the-art results on Alpaca Eval with Llama 8B and Mistral 7B. We release our datasets here. (Section 1) Empirically, AMPO sets a new benchmark on Alpaca Eval with Llama 8B, surpassing strong baselines like SIMPO by focusing on a small but strategically chosen set of responses each iteration (see Section 8).
Researcher Affiliation	Collaboration	1Microsoft 2IISc, Bangalore. Correspondence to: Taneesh Gupta <EMAIL>, Rahul Madhavan <EMAIL>.
Pseudocode	Yes	Algorithm 1 AMPO: One-Positive vs. K-Active Negatives Algorithm 2 AMPO-CORESET via k-means Algorithm 3 AMPO-OPTSELECT via Solving MIP
Open Source Code	No	Dataset Releases: We publicly release our AMPO-Coreset-Selection and AMPO-Opt-Selection datasets on Hugging Face. These contain curated response subsets for each prompt, facilitating research on multi-preference alignment. The paper states that datasets are released, but does not explicitly mention the release of source code for the methodology.
Open Datasets	Yes	Dataset Releases: We publicly release our AMPO-Coreset-Selection and AMPO-Opt-Selection datasets on Hugging Face. These contain curated response subsets for each prompt, facilitating research on multi-preference alignment. We utilize prompts from the Ultra Feedback dataset (Cui et al., 2023).
Dataset Splits	No	The paper describes generating responses and selecting subsets for training (e.g., "For each prompt x, we produce 32 responses by sampling from the SFT model...") and mentions using prompts from the Ultra Feedback dataset. However, it does not provide specific training, validation, or test dataset splits in terms of percentages, absolute counts, or references to predefined splits for its own experimental setup.
Hardware Specification	No	The paper does not explicitly mention any specific hardware components such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using specific models like "metallama/Meta Llama-3-8B-Instruct" and "Skywork/Skywork-Reward Llama-3.1-8B-v0.2", but it does not specify software dependencies like programming languages, libraries, or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For each prompt x, we produce 32 responses by sampling from the SFT model with a sampling temperature of 0.8. We found that setting the β (inverse temperature) parameter in the range of 5.0 to 10.0 consistently yields strong performance, while tuning the γ parameter within the range of 2 to 4 further improved performance. To examine how the number of negative comparisons affects performance, we evaluate AMPO-Opt-Select with increasing values of K in the 1-vs-K selection strategy—specifically, K = 3, 5, 7.