reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection

Authors: Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo, Lei Meng, Bang Liu, Jindong Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation results on four datasets show that Sparse RAG can be used to strike an optimal balance between generation quality and computational efﬁciency, demonstrating its generalizability across tasks.
Researcher Affiliation	Collaboration	1Google Deep Mind 2Google 3University of California, Los Angeles 4Universit e de Montr eal & Mila
Pseudocode	No	The paper describes methods and processes through narrative text and diagrams (Figure 1), but does not include any distinct pseudocode blocks or algorithms formatted as structured steps.
Open Source Code	No	The paper does not provide any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	We evaluate on four datasets: Pop QA (Mallen et al., 2023), QMSum (Min et al., 2023), Trivia QA (Joshi et al., 2017), and Hotpot QA (Yang et al., 2018).
Dataset Splits	Yes	Pop QA: We split the dataset into training, validation and test sets with 8:1:1 ratio. QMSum: We use 250 training examples (one per meeting), 70 validation examples and 77 test examples. Trivia QA: We randomly selected 8k training examples and 500 validation and test examples each. Hotpot QA: We sample 6k training examples and 600 validation and 600 test examples.
Hardware Specification	Yes	During training, we use 64 Tensor Processing Units (TPU) V3 chips for Pop QA while use 128 Units for the other datasets. Evaluation of Sparse RAG was conducted on a Samsung S21 Ultra, utilizing the device s CPU to assess real-world performance on a relatively mid-tier smartphone compared to the latest ﬂagship models.
Software Dependencies	No	The paper mentions software components like "Gemini (Team et al., 2023)", "Lo RA tuning (Hu et al., 2022)", and "Adafactor optimizer (Shazeer & Stern, 2018)" but does not provide specific version numbers for underlying libraries or programming environments (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	In all our experiments, we apply Lo RA in self-attention and use the default rank as 4. By default, we use the XXS size of Gemini which can run on-device. The batch size is 64. We use the Adafactor optimizer (Shazeer & Stern, 2018) with a learning rate of 0.003. The training dropout rate is 0.05. During inference, the temperature is set to 0.5. Unless speciﬁcally noted, we use sampling decoding with sample number 1 for our experiments.