Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Authors: Zilong (Ryan) Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, Chen-Yu Lee, Tomas Pfister

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that SPECULATIVE RAG achieves state-of-the-art performance with reduced latency on Trivia QA, Mu Si Que, Pop QA, Pub Health, and ARC-Challenge benchmarks. It notably enhances accuracy by up to 12.97% while reducing latency by 50.83% compared to conventional RAG systems on Pub Health.
Researcher Affiliation Collaboration 1University of California, San Diego 2Google Cloud AI Research 3Google Deep Mind 4Google Cloud AI
Pseudocode Yes Algorithm 1: SPECULATIVE RAG
Open Source Code No The paper mentions using the Transformers library from Hugging Face and Deep Speed, but it does not provide an explicit statement or link for the open-sourcing of their own implementation code for the described methodology.
Open Datasets Yes We evaluate our proposed SPECULATIVE RAG on five public retrieval augmented generation benchmarks: Trivia QA (unfiltered) (Joshi et al., 2017), Mu Si Que (Trivedi et al., 2022), Pop QA (Mallen et al., 2023), Pub Health (Zhang et al., 2023b), and ARC-Challenge (Clark et al., 2018).
Dataset Splits No The paper provides specific details on how documents are retrieved and drafts are generated (e.g., "retrieve top 10 documents and generate 5 drafts per query"), and for Hotpot QA, it states "We randomly sample 500 examples from the validation set of Hotpot QA as the test set in our experiment.". However, it does not provide comprehensive training/test/validation splits (e.g., percentages or exact counts for all datasets) for reproducing the overall experiments of all benchmarks.
Hardware Specification Yes All experiments are conducted on a Linux server equipped with 16 Nvidia A100-SXM440GB GPUs.
Software Dependencies No The paper states: "We implement the training scripts using the Transformers library from Hugging Face (Wolf et al., 2019). We employ Deep Speed (Rasley et al., 2020) to accelerate the training process." While it names libraries, it does not provide specific version numbers for them.
Experiment Setup Yes In our experiments, we utilize Mistral7B (v0.1) as our base LM for the RAG drafter. For RAG verifier, we employ either Mistral7B (v0.1) or Mixtral8x7B (v0.1) without any fine-tuning, denoted as MVerifier-7B or MVerifier-8x7B. [...] Inference is conducted using the v LLM framework (Kwon et al., 2023) with greedy decoding (temperature = 0). [...] On Trivia QA, Pop QA, Pub Health, and ARC-Challenge, we retrieve top 10 documents and generate 5 drafts per query (m = 5), with each draft based on a subset of 2 documents (k = 2). For Mu Si Que, we retrieve top 15 documents and generate 10 drafts for each query (m = 10), each using a subset of 6 documents due to more complex reasoning.