Self-Play Preference Optimization for Language Model Alignment

Authors: Yue Wu, Zhiqing Sun, Rina Hughes, Kaixuan Ji, Yiming Yang, Quanquan Gu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, using only 60k prompts (without responses) from the Ultra Feedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model Pair RM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on Alpaca Eval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench, Arena-Hard, and the Open LLM Leaderboard.
Researcher Affiliation Academia 1Department of Computer Science, University of California, Los Angeles 2Language Technologies Institute, Carnegie Mellon University EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Self-Play Preference Optimization (SPPO)
Open Source Code No The paper references third-party models and tools like Snorkel-Mistral-Pair RM-DPO (https://huggingface.co/snorkelai/Snorkel-Mistral-Pair RM-DPO), Mistral-7B-Instruct-v0.2 (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2), Llama-3-8B-Instruct (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), and Pair RM (https://huggingface.co/llm-blender/Pair RM). However, there is no explicit statement or link provided for the source code of the Self-Play Preference Optimization (SPPO) methodology developed in this paper.
Open Datasets Yes We also adopt Ultrafeedback (Cui et al., 2023) as our source of prompts which includes around 60k prompts from diverse resources. ... We use Alpaca Eval 2.0 (Dubois et al., 2024a), Arena Hard(Li et al., 2024), MT-Bench (Zheng et al., 2024), and Open LLM Leaderboard (Beeching et al., 2023b) as our evaluation benchmarks.
Dataset Splits Yes We split the dataset into three portions to avoid overfitting and ensure fair comparison with Snorkel. We follow the splitting in Snorkel for a fair comparison. ... In each iteration, we selected the model trained on the first epoch of the 20k prompts from Ultra Feedback to proceed to the next iteration.
Hardware Specification Yes The experiments are conducted on 8 Nvidia A100 GPUs.
Software Dependencies No The paper mentions base models like Mistral-7B-Instruct-v0.2 and Llama-3-8B-Instruct, and a preference model Pair RM based on De BERTA-V3. However, it does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used to implement the methodology.
Experiment Setup Yes For SPPO, we trained three iterations in total. In each iteration, we selected the model trained on the first epoch of the 20k prompts from Ultra Feedback to proceed to the next iteration. For both Mistral-7BInstruct-v0.2 and Llama-3-8B-Instruct, the global training batch size is set to 64, and η is set to 1e3. The learning rate schedule is determined by the following hyperparameters: learning rate=5.0e-7, number of total training epochs=18, warmup ratio=0.1, linear schedule. In practice, early stopping after the first epoch yields the best test performance.