Self-Improving Robust Preference Optimization
Authors: Eugene Choi, Arash Ahmadian, Matthieu Geist, Olivier Pietquin, Mohammad Gheshlaghi Azar
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate SRPO s effectiveness, we evaluate it using AI Win-Rate (WR) against human (GOLD) completions. When tested on the XSum dataset, SRPO outperforms DPO by a margin of 15% after 5 self-revisions, achieving an impressive 90% WR. Moreover, on the challenging Arena-Hard prompts, SRPO outperforms both DPO and IPO (by 4% without revision and 6% after a single revision), reaching a 56% WR against against Llama-3.1-8B-Instruct. |
| Researcher Affiliation | Industry | Eugene Choi Cohere Arash Ahmadian Cohere for AI Matthieu Geist Cohere Olivier Pietquin Cohere Mohammad Gheshlaghi Azar Cohere. Corresponding Authors: EMAIL Research was done at Cohere. |
| Pseudocode | Yes | Algorithm 1 Sampled SRPO |
| Open Source Code | No | The paper mentions using and fine-tuning models from third-party libraries (e.g., TRL library, OPTAX and FLAX libraries of JAX) and provides links to their GitHub repositories. However, it does not explicitly state that the authors' own implementation code for SRPO is open-source or provide a link to their specific repository. |
| Open Datasets | Yes | We use the Reddit TL;DR Summarization dataset (Stiennon et al., 2020) as the main dataset for our experiments. We also use the XSum dataset test split (Narayan et al., 2018). We use the Share GPT-Vicuna dataset for SFT. For our preference training, we use the binarized version of the Ultra Feedback (Cui et al., 2023). For the win-rate evaluation, we use the Arena-Hard dataset. |
| Dataset Splits | Yes | For training, there are 116k human-written instruction following examples with reference completions (SFT split) while there are 93k human-annotated preference pairs (Preference split). We also use the XSum dataset test split... which contains 11.5k total test examples. In both settings, we use the first 1,024 samples from each of the test sets. To estimate the win rate more accurately with confidence intervals, we bootstrap 20 times with replacement from the 1,024 samples, each time using a sample size of 512. We use the Share GPT-Vicuna dataset for SFT, a filtered version of the original dataset containing 53k prompt and completion pairs. For our preference training, we use the binarized version of the Ultra Feedback (Cui et al., 2023), which contains 64k pairwise preference data. For the win-rate evaluation, we use the Arena-Hard dataset, which contains 500 prompts. |
| Hardware Specification | Yes | We use LLa MA-7B as base model (Touvron et al., 2023) and a single 8 NVIDIA H100 node to conduct all LLa MA-based experiments. We use LLa MA-3.1-8B Base as base model (Dubey & others, 2024) and a Google cloud v5litepod-256 TPUs to conduct all LLa MA-based training and evaluation. |
| Software Dependencies | No | The paper mentions using specific software components like the Adam W optimizer, PEFT settings in the TRL library, OPTAX and FLAX libraries of JAX, but it does not provide explicit version numbers for these software dependencies. |
| Experiment Setup | Yes | In the SFT stage, we train for 2 epochs, using the Adam W optimizer (Loshchilov & Hutter, 2019), with β1 = 0.9 and β2 = 0.999, and 0.1 weight-decay. We use a cosine decay learning rate (Loshchilov & Hutter, 2017) with a peak value of 2 10 5 and 3% of all steps being warm-up steps. We use an effective batch-size of 64. All models were trained for 5 epochs on the TL;DR preference split using the same optimization setting of the Adam W optimizer as in the SFT stage with 150 warmup steps, and an effective batch-size of 128. For SRPO and IPO, we used β = 0.01 with a learning rate of 2 10 6. For DPO following Rafailov et al. (2023), we used the common β = 0.1 with a learning rate of 1 10 6 and a constant learning rate schedule. |