reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Self-Play Preference Optimization for Language Model Alignment

Authors: Yue Wu, Zhiqing Sun, Rina Hughes, Kaixuan Ji, Yiming Yang, Quanquan Gu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, using only 60k prompts (without responses) from the Ultra Feedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model Pair RM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on Alpaca Eval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench, Arena-Hard, and the Open LLM Leaderboard.
Researcher Affiliation	Academia	1Department of Computer Science, University of California, Los Angeles 2Language Technologies Institute, Carnegie Mellon University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Self-Play Preference Optimization (SPPO)
Open Source Code	No	The paper references third-party models and tools like Snorkel-Mistral-Pair RM-DPO (https://huggingface.co/snorkelai/Snorkel-Mistral-Pair RM-DPO), Mistral-7B-Instruct-v0.2 (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2), Llama-3-8B-Instruct (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), and Pair RM (https://huggingface.co/llm-blender/Pair RM). However, there is no explicit statement or link provided for the source code of the Self-Play Preference Optimization (SPPO) methodology developed in this paper.
Open Datasets	Yes	We also adopt Ultrafeedback (Cui et al., 2023) as our source of prompts which includes around 60k prompts from diverse resources. ... We use Alpaca Eval 2.0 (Dubois et al., 2024a), Arena Hard(Li et al., 2024), MT-Bench (Zheng et al., 2024), and Open LLM Leaderboard (Beeching et al., 2023b) as our evaluation benchmarks.
Dataset Splits	Yes	We split the dataset into three portions to avoid overfitting and ensure fair comparison with Snorkel. We follow the splitting in Snorkel for a fair comparison. ... In each iteration, we selected the model trained on the first epoch of the 20k prompts from Ultra Feedback to proceed to the next iteration.
Hardware Specification	Yes	The experiments are conducted on 8 Nvidia A100 GPUs.
Software Dependencies	No	The paper mentions base models like Mistral-7B-Instruct-v0.2 and Llama-3-8B-Instruct, and a preference model Pair RM based on De BERTA-V3. However, it does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used to implement the methodology.
Experiment Setup	Yes	For SPPO, we trained three iterations in total. In each iteration, we selected the model trained on the first epoch of the 20k prompts from Ultra Feedback to proceed to the next iteration. For both Mistral-7BInstruct-v0.2 and Llama-3-8B-Instruct, the global training batch size is set to 64, and η is set to 1e3. The learning rate schedule is determined by the following hyperparameters: learning rate=5.0e-7, number of total training epochs=18, warmup ratio=0.1, linear schedule. In practice, early stopping after the first epoch yields the best test performance.