Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection
Authors: Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo, Lei Meng, Bang Liu, Jindong Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluation results on four datasets show that Sparse RAG can be used to strike an optimal balance between generation quality and computational efficiency, demonstrating its generalizability across tasks. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2Google 3University of California, Los Angeles 4Universit e de Montr eal & Mila |
| Pseudocode | No | The paper describes methods and processes through narrative text and diagrams (Figure 1), but does not include any distinct pseudocode blocks or algorithms formatted as structured steps. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We evaluate on four datasets: Pop QA (Mallen et al., 2023), QMSum (Min et al., 2023), Trivia QA (Joshi et al., 2017), and Hotpot QA (Yang et al., 2018). |
| Dataset Splits | Yes | Pop QA: We split the dataset into training, validation and test sets with 8:1:1 ratio. QMSum: We use 250 training examples (one per meeting), 70 validation examples and 77 test examples. Trivia QA: We randomly selected 8k training examples and 500 validation and test examples each. Hotpot QA: We sample 6k training examples and 600 validation and 600 test examples. |
| Hardware Specification | Yes | During training, we use 64 Tensor Processing Units (TPU) V3 chips for Pop QA while use 128 Units for the other datasets. Evaluation of Sparse RAG was conducted on a Samsung S21 Ultra, utilizing the device s CPU to assess real-world performance on a relatively mid-tier smartphone compared to the latest flagship models. |
| Software Dependencies | No | The paper mentions software components like "Gemini (Team et al., 2023)", "Lo RA tuning (Hu et al., 2022)", and "Adafactor optimizer (Shazeer & Stern, 2018)" but does not provide specific version numbers for underlying libraries or programming environments (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | In all our experiments, we apply Lo RA in self-attention and use the default rank as 4. By default, we use the XXS size of Gemini which can run on-device. The batch size is 64. We use the Adafactor optimizer (Shazeer & Stern, 2018) with a learning rate of 0.003. The training dropout rate is 0.05. During inference, the temperature is set to 0.5. Unless specifically noted, we use sampling decoding with sample number 1 for our experiments. |