Self-Play Preference Optimization for Language Model Alignment
Authors: Yue Wu, Zhiqing Sun, Rina Hughes, Kaixuan Ji, Yiming Yang, Quanquan Gu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, using only 60k prompts (without responses) from the Ultra Feedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model Pair RM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on Alpaca Eval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench, Arena-Hard, and the Open LLM Leaderboard. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of California, Los Angeles 2Language Technologies Institute, Carnegie Mellon University EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Self-Play Preference Optimization (SPPO) |
| Open Source Code | No | The paper references third-party models and tools like Snorkel-Mistral-Pair RM-DPO (https://huggingface.co/snorkelai/Snorkel-Mistral-Pair RM-DPO), Mistral-7B-Instruct-v0.2 (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2), Llama-3-8B-Instruct (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), and Pair RM (https://huggingface.co/llm-blender/Pair RM). However, there is no explicit statement or link provided for the source code of the Self-Play Preference Optimization (SPPO) methodology developed in this paper. |
| Open Datasets | Yes | We also adopt Ultrafeedback (Cui et al., 2023) as our source of prompts which includes around 60k prompts from diverse resources. ... We use Alpaca Eval 2.0 (Dubois et al., 2024a), Arena Hard(Li et al., 2024), MT-Bench (Zheng et al., 2024), and Open LLM Leaderboard (Beeching et al., 2023b) as our evaluation benchmarks. |
| Dataset Splits | Yes | We split the dataset into three portions to avoid overfitting and ensure fair comparison with Snorkel. We follow the splitting in Snorkel for a fair comparison. ... In each iteration, we selected the model trained on the first epoch of the 20k prompts from Ultra Feedback to proceed to the next iteration. |
| Hardware Specification | Yes | The experiments are conducted on 8 Nvidia A100 GPUs. |
| Software Dependencies | No | The paper mentions base models like Mistral-7B-Instruct-v0.2 and Llama-3-8B-Instruct, and a preference model Pair RM based on De BERTA-V3. However, it does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used to implement the methodology. |
| Experiment Setup | Yes | For SPPO, we trained three iterations in total. In each iteration, we selected the model trained on the first epoch of the 20k prompts from Ultra Feedback to proceed to the next iteration. For both Mistral-7BInstruct-v0.2 and Llama-3-8B-Instruct, the global training batch size is set to 64, and η is set to 1e3. The learning rate schedule is determined by the following hyperparameters: learning rate=5.0e-7, number of total training epochs=18, warmup ratio=0.1, linear schedule. In practice, early stopping after the first epoch yields the best test performance. |