Inference Scaling for Long-Context Retrieval Augmented Generation
Authors: Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on benchmark QA datasets, we demonstrate an almost linear relationship between RAG performance and the scale of effective context length by combining both RAG strategies, as shown in Figure 1 (right). |
| Researcher Affiliation | Collaboration | Zhenrui Yue 1, Honglei Zhuang 2, Aijun Bai2, Kai Hui2, Rolf Jagerman2, Hansi Zeng 3 Zhen Qin2, Dong Wang1, Xuanhui Wang2, Michael Bendersky2 1University of Illinois Urbana-Champaign, 2Google Deep Mind, 3UMass Amherst EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods (DRAG, Iter DRAG) in text and shows example prompts in Figure 16, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing its own source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We evaluate the performance of Gemini 1.5 Flash with context length window up to 1M tokens on knowledge-intensive question answering, including multi-hop datasets Bamboogle, Hotpot QA, Mu Si Que and 2Wiki Multi Hop QA (Press et al., 2023; Yang et al., 2018; Trivedi et al., 2022; Ho et al., 2020). Also, Wikipedia passages from the KILT benchmark as the document source (Petroni et al., 2020). |
| Dataset Splits | No | To manage the computational costs of extensive experiments, we follow Wu et al. (2024); Gutiérrez et al. (2024) and sample 1.2k examples from each dataset for evaluation. The paper mentions sampling for evaluation but does not provide specific train/test/validation splits. |
| Hardware Specification | No | For generation, we utilize Gemini 1.5 Flash for more efficient experiments. The paper mentions the language model used but does not provide specific hardware details (e.g., GPU/CPU models, memory) on which the experiments were conducted. |
| Software Dependencies | No | The paper mentions using specific models like 'Gecko-1B (en) embedding model' and 'Gemini 1.5 Flash', but does not list any specific software dependencies with version numbers used for implementing their methodology. |
| Experiment Setup | Yes | For the parameter space Θ of DRAG, we consider the number of documents k {0, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000}, and the number in-context examples m ranging from 0, 20, 21, ..., to 28. For Iter DRAG, we further experiment with number of iterations n up to 5. We allow up to five iterations, after which the model is forced to produce the final answer. Each document is then truncated on the right side to a maximum of 1024 tokens. |