reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales

Authors: Zhepei Wei, Wei-Lin Chen, Yu Meng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show INSTRUCTRAG consistently outperforms existing RAG methods in both training-free and trainable scenarios, achieving a relative improvement of 8.3% over the best baseline method on average across five knowledge-intensive benchmarks. Extensive analysis indicates that INSTRUCTRAG scales well with increased numbers of retrieved documents and consistently exhibits robust denoising ability even in out-of-domain datasets, demonstrating strong generalizability.
Researcher Affiliation	Academia	Zhepei Wei Wei-Lin Chen Yu Meng Department of Computer Science University of Virginia EMAIL
Pseudocode	Yes	Algorithm 1 INSTRUCTRAG
Open Source Code	Yes	Code is available at https://github.com/weizhepei/Instruct RAG.
Open Datasets	Yes	We extensively validate the effectiveness of INSTRUCTRAG on five knowledge-intensive benchmarks, including Pop QA (Mallen et al., 2023), Trivia QA (Joshi et al., 2017), Natural Questions (Kwiatkowski et al., 2019), ASQA (Stelmakh et al., 2022), and 2Wiki Multi Hop QA (Ho et al., 2020). We use Wikipedia corpus as the retrieval source, and test our method with both sparse and dense off-the-shelf retrievers, including BM25 (Robertson & Walker, 1994), DPR (Karpukhin et al., 2020), GTR (Ni et al., 2022) and Contriver (Izacard et al., 2021).
Dataset Splits	Yes	Table 2: Dataset statistics and retrieval setting. Dataset Train Test Retriever Top-K Recall@K Pop QA 12,868 1,399 Contriever 5 68.7 Trivia QA 78,785 11,313 Contriever 5 73.5 Natural Questions 79,168 3,610 DPR 5 68.8 ASQA 4,353 948 GTR 5 82.2 2Wiki Multi Hop QA 167,454 12,576 BM25 10 40.7
Hardware Specification	Yes	Our models are trained on 4 Nvidia H100 GPUs with 80GB memory via full-parameter fine-tuning.
Software Dependencies	No	The paper mentions 'Flash Attention (Dao, 2023)', 'Adam optimizer (Kingma & Ba, 2014)', 'Pyserini (Lin et al., 2021)', and 'v LLM (Kwon et al., 2023)'. While these are specific tools/techniques with citations, they do not explicitly provide specific version numbers for software libraries like Python, PyTorch, or CUDA, which are typically required for reproducible software dependency listings.
Experiment Setup	Yes	By default, all models are trained using the Adam optimizer (Kingma & Ba, 2014) for 2 epochs, with a batch size of 128, a learning rate of 2.5e-5, and a cosine learning rate schedule with 3% warmup steps. For the trainable baseline vanilla SFT, we use a slightly different learning rate of 2e-5 based on our hyper-parameter search results. To fairly compare with Self-RAG and Ret Robust, we re-implement them using Llama-3-Instruct-8B. We also optimize their performance through an extensive hyper-parameter search with learning rates in [8e-6, 1e-5, 2e-5] and training epochs in [1, 2, 3]. For Self-RAG, we use a learning rate of 1e-5 with a single training epoch. For Ret Robust, we use a learning rate of 2e-5 with two training epochs. The only exception is the training for Ret Robust on 2Wiki Multi Hop QA, where we train the model for 5 epochs on the augmented training set released by the original authors. The maximum token length for all models is fixed at 4096. By default, the number of demonstrations used in INSTRUCTRAG-ICL and the baseline method few-shot demonstration with instruction is set to be 2.