reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

R.I.P.: Better Models by Survival of the Fittest Prompts

Authors: Ping Yu, Weizhe Yuan, Olga Golovneva, Tianhao Wu, Sainbayar Sukhbaatar, Jason E Weston, Jing Xu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using Llama 3.1-8B-Instruct, RIP improves Alpaca Eval2 LC Win Rate by 9.4%, Arena-Hard by 8.7%, and Wild Bench by 9.9%.
Researcher Affiliation	Collaboration	1Meta 2New York University 3UC Berkeley. Correspondence to: Jing Xu <EMAIL>.
Pseudocode	No	The paper describes the 'Rejecting Instruction Preferences (RIP)' method and the 'Self-RIP' approach in Section 3, outlining steps for filtering and synthetic prompt generation, but it does not present these as structured pseudocode or algorithm blocks.
Open Source Code	No	We release our filtered datasets on Hugging Face1. 1For the Llama-3.1-8B-Instruct filtered dataset, visit: https://huggingface.co/datasets/facebook/Wildchat-RIP-Filteredby-8b-Llama. For the Llama-3.3-70B-Instruct filtered dataset, visit: https://huggingface.co/datasets/facebook/Wildchat-RIP-Filteredby-70b-Llama. The paper provides links to filtered datasets, not to the source code for the methodology.
Open Datasets	Yes	We release our filtered datasets on Hugging Face1. 1For the Llama-3.1-8B-Instruct filtered dataset, visit: https://huggingface.co/datasets/facebook/Wildchat-RIP-Filteredby-8b-Llama. For the Llama-3.3-70B-Instruct filtered dataset, visit: https://huggingface.co/datasets/facebook/Wildchat-RIP-Filteredby-70b-Llama. Additionally, the paper mentions and cites Wild Chat (Zhao et al., 2024b) and Help Steer2 (Wang et al., 2024c) datasets.
Dataset Splits	Yes	We perform early stopping using a validation set of 470 examples: 253 valid set examples from Li et al. (2024c) and 218 examples from the evol-test set of Xu et al. (2023a), with prompts that overlap with Alpaca Eval2 removed. We perform early stopping on the Help Steer2 validation split, selecting checkpoints with the highest average response rewards determined by Armo RM.
Hardware Specification	No	The paper mentions using Llama 3.1-8B-Instruct and Llama 3.3-70B-Instruct models for experiments but does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud resources) used for training or inference.
Software Dependencies	Yes	We utilize the DPO training approach with the off-the-shelf LLama 3.1-8B-Instruct and LLama 3.3-70B-Instruct models, leveraging the fairseq2 library (Balioglu, 2023).
Experiment Setup	Yes	We use a batch size of 64 and sweep over learning rates of 5e 7, 1e 6 for the LLama 3.1-8B-Instruct model, and a learning rate of 1e 6 with a batch size of 256 for the LLama 3.3-70B-Instruct model. Both models are trained with a dropout rate of 0.0 and a β value of 0.1 throughout the experiments.