reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Authors: Zilong (Ryan) Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, Chen-Yu Lee, Tomas Pfister

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that SPECULATIVE RAG achieves state-of-the-art performance with reduced latency on Trivia QA, Mu Si Que, Pop QA, Pub Health, and ARC-Challenge benchmarks. It notably enhances accuracy by up to 12.97% while reducing latency by 50.83% compared to conventional RAG systems on Pub Health.
Researcher Affiliation	Collaboration	1University of California, San Diego 2Google Cloud AI Research 3Google Deep Mind 4Google Cloud AI
Pseudocode	Yes	Algorithm 1: SPECULATIVE RAG
Open Source Code	No	The paper mentions using the Transformers library from Hugging Face and Deep Speed, but it does not provide an explicit statement or link for the open-sourcing of their own implementation code for the described methodology.
Open Datasets	Yes	We evaluate our proposed SPECULATIVE RAG on five public retrieval augmented generation benchmarks: Trivia QA (unfiltered) (Joshi et al., 2017), Mu Si Que (Trivedi et al., 2022), Pop QA (Mallen et al., 2023), Pub Health (Zhang et al., 2023b), and ARC-Challenge (Clark et al., 2018).
Dataset Splits	No	The paper provides specific details on how documents are retrieved and drafts are generated (e.g., "retrieve top 10 documents and generate 5 drafts per query"), and for Hotpot QA, it states "We randomly sample 500 examples from the validation set of Hotpot QA as the test set in our experiment.". However, it does not provide comprehensive training/test/validation splits (e.g., percentages or exact counts for all datasets) for reproducing the overall experiments of all benchmarks.
Hardware Specification	Yes	All experiments are conducted on a Linux server equipped with 16 Nvidia A100-SXM440GB GPUs.
Software Dependencies	No	The paper states: "We implement the training scripts using the Transformers library from Hugging Face (Wolf et al., 2019). We employ Deep Speed (Rasley et al., 2020) to accelerate the training process." While it names libraries, it does not provide specific version numbers for them.
Experiment Setup	Yes	In our experiments, we utilize Mistral7B (v0.1) as our base LM for the RAG drafter. For RAG verifier, we employ either Mistral7B (v0.1) or Mixtral8x7B (v0.1) without any fine-tuning, denoted as MVeriﬁer-7B or MVeriﬁer-8x7B. [...] Inference is conducted using the v LLM framework (Kwon et al., 2023) with greedy decoding (temperature = 0). [...] On Trivia QA, Pop QA, Pub Health, and ARC-Challenge, we retrieve top 10 documents and generate 5 drafts per query (m = 5), with each draft based on a subset of 2 documents (k = 2). For Mu Si Que, we retrieve top 15 documents and generate 10 drafts for each query (m = 10), each using a subset of 6 documents due to more complex reasoning.