Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales
Authors: Zhepei Wei, Wei-Lin Chen, Yu Meng
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show INSTRUCTRAG consistently outperforms existing RAG methods in both training-free and trainable scenarios, achieving a relative improvement of 8.3% over the best baseline method on average across five knowledge-intensive benchmarks. Extensive analysis indicates that INSTRUCTRAG scales well with increased numbers of retrieved documents and consistently exhibits robust denoising ability even in out-of-domain datasets, demonstrating strong generalizability. |
| Researcher Affiliation | Academia | Zhepei Wei Wei-Lin Chen Yu Meng Department of Computer Science University of Virginia EMAIL |
| Pseudocode | Yes | Algorithm 1 INSTRUCTRAG |
| Open Source Code | Yes | Code is available at https://github.com/weizhepei/Instruct RAG. |
| Open Datasets | Yes | We extensively validate the effectiveness of INSTRUCTRAG on five knowledge-intensive benchmarks, including Pop QA (Mallen et al., 2023), Trivia QA (Joshi et al., 2017), Natural Questions (Kwiatkowski et al., 2019), ASQA (Stelmakh et al., 2022), and 2Wiki Multi Hop QA (Ho et al., 2020). We use Wikipedia corpus as the retrieval source, and test our method with both sparse and dense off-the-shelf retrievers, including BM25 (Robertson & Walker, 1994), DPR (Karpukhin et al., 2020), GTR (Ni et al., 2022) and Contriver (Izacard et al., 2021). |
| Dataset Splits | Yes | Table 2: Dataset statistics and retrieval setting. Dataset Train Test Retriever Top-K Recall@K Pop QA 12,868 1,399 Contriever 5 68.7 Trivia QA 78,785 11,313 Contriever 5 73.5 Natural Questions 79,168 3,610 DPR 5 68.8 ASQA 4,353 948 GTR 5 82.2 2Wiki Multi Hop QA 167,454 12,576 BM25 10 40.7 |
| Hardware Specification | Yes | Our models are trained on 4 Nvidia H100 GPUs with 80GB memory via full-parameter fine-tuning. |
| Software Dependencies | No | The paper mentions 'Flash Attention (Dao, 2023)', 'Adam optimizer (Kingma & Ba, 2014)', 'Pyserini (Lin et al., 2021)', and 'v LLM (Kwon et al., 2023)'. While these are specific tools/techniques with citations, they do not explicitly provide specific version numbers for software libraries like Python, PyTorch, or CUDA, which are typically required for reproducible software dependency listings. |
| Experiment Setup | Yes | By default, all models are trained using the Adam optimizer (Kingma & Ba, 2014) for 2 epochs, with a batch size of 128, a learning rate of 2.5e-5, and a cosine learning rate schedule with 3% warmup steps. For the trainable baseline vanilla SFT, we use a slightly different learning rate of 2e-5 based on our hyper-parameter search results. To fairly compare with Self-RAG and Ret Robust, we re-implement them using Llama-3-Instruct-8B. We also optimize their performance through an extensive hyper-parameter search with learning rates in [8e-6, 1e-5, 2e-5] and training epochs in [1, 2, 3]. For Self-RAG, we use a learning rate of 1e-5 with a single training epoch. For Ret Robust, we use a learning rate of 2e-5 with two training epochs. The only exception is the training for Ret Robust on 2Wiki Multi Hop QA, where we train the model for 5 epochs on the augmented training set released by the original authors. The maximum token length for all models is fixed at 4096. By default, the number of demonstrations used in INSTRUCTRAG-ICL and the baseline method few-shot demonstration with instruction is set to be 2. |