reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Self-Improving Robust Preference Optimization

Authors: Eugene Choi, Arash Ahmadian, Matthieu Geist, Olivier Pietquin, Mohammad Gheshlaghi Azar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate SRPO s effectiveness, we evaluate it using AI Win-Rate (WR) against human (GOLD) completions. When tested on the XSum dataset, SRPO outperforms DPO by a margin of 15% after 5 self-revisions, achieving an impressive 90% WR. Moreover, on the challenging Arena-Hard prompts, SRPO outperforms both DPO and IPO (by 4% without revision and 6% after a single revision), reaching a 56% WR against against Llama-3.1-8B-Instruct.
Researcher Affiliation	Industry	Eugene Choi Cohere Arash Ahmadian Cohere for AI Matthieu Geist Cohere Olivier Pietquin Cohere Mohammad Gheshlaghi Azar Cohere. Corresponding Authors: EMAIL Research was done at Cohere.
Pseudocode	Yes	Algorithm 1 Sampled SRPO
Open Source Code	No	The paper mentions using and fine-tuning models from third-party libraries (e.g., TRL library, OPTAX and FLAX libraries of JAX) and provides links to their GitHub repositories. However, it does not explicitly state that the authors' own implementation code for SRPO is open-source or provide a link to their specific repository.
Open Datasets	Yes	We use the Reddit TL;DR Summarization dataset (Stiennon et al., 2020) as the main dataset for our experiments. We also use the XSum dataset test split (Narayan et al., 2018). We use the Share GPT-Vicuna dataset for SFT. For our preference training, we use the binarized version of the Ultra Feedback (Cui et al., 2023). For the win-rate evaluation, we use the Arena-Hard dataset.
Dataset Splits	Yes	For training, there are 116k human-written instruction following examples with reference completions (SFT split) while there are 93k human-annotated preference pairs (Preference split). We also use the XSum dataset test split... which contains 11.5k total test examples. In both settings, we use the first 1,024 samples from each of the test sets. To estimate the win rate more accurately with confidence intervals, we bootstrap 20 times with replacement from the 1,024 samples, each time using a sample size of 512. We use the Share GPT-Vicuna dataset for SFT, a filtered version of the original dataset containing 53k prompt and completion pairs. For our preference training, we use the binarized version of the Ultra Feedback (Cui et al., 2023), which contains 64k pairwise preference data. For the win-rate evaluation, we use the Arena-Hard dataset, which contains 500 prompts.
Hardware Specification	Yes	We use LLa MA-7B as base model (Touvron et al., 2023) and a single 8 NVIDIA H100 node to conduct all LLa MA-based experiments. We use LLa MA-3.1-8B Base as base model (Dubey & others, 2024) and a Google cloud v5litepod-256 TPUs to conduct all LLa MA-based training and evaluation.
Software Dependencies	No	The paper mentions using specific software components like the Adam W optimizer, PEFT settings in the TRL library, OPTAX and FLAX libraries of JAX, but it does not provide explicit version numbers for these software dependencies.
Experiment Setup	Yes	In the SFT stage, we train for 2 epochs, using the Adam W optimizer (Loshchilov & Hutter, 2019), with β1 = 0.9 and β2 = 0.999, and 0.1 weight-decay. We use a cosine decay learning rate (Loshchilov & Hutter, 2017) with a peak value of 2 10 5 and 3% of all steps being warm-up steps. We use an effective batch-size of 64. All models were trained for 5 epochs on the TL;DR preference split using the same optimization setting of the Adam W optimizer as in the SFT stage with 150 warmup steps, and an effective batch-size of 128. For SRPO and IPO, we used β = 0.01 with a learning rate of 2 10 6. For DPO following Rafailov et al. (2023), we used the common β = 0.1 with a learning rate of 1 10 6 and a constant learning rate schedule.