R.I.P.: Better Models by Survival of the Fittest Prompts

Authors: Ping Yu, Weizhe Yuan, Olga Golovneva, Tianhao Wu, Sainbayar Sukhbaatar, Jason E Weston, Jing Xu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using Llama 3.1-8B-Instruct, RIP improves Alpaca Eval2 LC Win Rate by 9.4%, Arena-Hard by 8.7%, and Wild Bench by 9.9%.
Researcher Affiliation Collaboration 1Meta 2New York University 3UC Berkeley. Correspondence to: Jing Xu <EMAIL>.
Pseudocode No The paper describes the 'Rejecting Instruction Preferences (RIP)' method and the 'Self-RIP' approach in Section 3, outlining steps for filtering and synthetic prompt generation, but it does not present these as structured pseudocode or algorithm blocks.
Open Source Code No We release our filtered datasets on Hugging Face1. 1For the Llama-3.1-8B-Instruct filtered dataset, visit: https://huggingface.co/datasets/facebook/Wildchat-RIP-Filteredby-8b-Llama. For the Llama-3.3-70B-Instruct filtered dataset, visit: https://huggingface.co/datasets/facebook/Wildchat-RIP-Filteredby-70b-Llama. The paper provides links to filtered datasets, not to the source code for the methodology.
Open Datasets Yes We release our filtered datasets on Hugging Face1. 1For the Llama-3.1-8B-Instruct filtered dataset, visit: https://huggingface.co/datasets/facebook/Wildchat-RIP-Filteredby-8b-Llama. For the Llama-3.3-70B-Instruct filtered dataset, visit: https://huggingface.co/datasets/facebook/Wildchat-RIP-Filteredby-70b-Llama. Additionally, the paper mentions and cites Wild Chat (Zhao et al., 2024b) and Help Steer2 (Wang et al., 2024c) datasets.
Dataset Splits Yes We perform early stopping using a validation set of 470 examples: 253 valid set examples from Li et al. (2024c) and 218 examples from the evol-test set of Xu et al. (2023a), with prompts that overlap with Alpaca Eval2 removed. We perform early stopping on the Help Steer2 validation split, selecting checkpoints with the highest average response rewards determined by Armo RM.
Hardware Specification No The paper mentions using Llama 3.1-8B-Instruct and Llama 3.3-70B-Instruct models for experiments but does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud resources) used for training or inference.
Software Dependencies Yes We utilize the DPO training approach with the off-the-shelf LLama 3.1-8B-Instruct and LLama 3.3-70B-Instruct models, leveraging the fairseq2 library (Balioglu, 2023).
Experiment Setup Yes We use a batch size of 64 and sweep over learning rates of 5e 7, 1e 6 for the LLama 3.1-8B-Instruct model, and a learning rate of 1e 6 with a batch size of 256 for the LLama 3.3-70B-Instruct model. Both models are trained with a dropout rate of 0.0 and a β value of 0.1 throughout the experiments.