Inference Scaling for Long-Context Retrieval Augmented Generation

Authors: Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on benchmark QA datasets, we demonstrate an almost linear relationship between RAG performance and the scale of effective context length by combining both RAG strategies, as shown in Figure 1 (right).
Researcher Affiliation Collaboration Zhenrui Yue 1, Honglei Zhuang 2, Aijun Bai2, Kai Hui2, Rolf Jagerman2, Hansi Zeng 3 Zhen Qin2, Dong Wang1, Xuanhui Wang2, Michael Bendersky2 1University of Illinois Urbana-Champaign, 2Google Deep Mind, 3UMass Amherst EMAIL, EMAIL
Pseudocode No The paper describes methods (DRAG, Iter DRAG) in text and shows example prompts in Figure 16, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing its own source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We evaluate the performance of Gemini 1.5 Flash with context length window up to 1M tokens on knowledge-intensive question answering, including multi-hop datasets Bamboogle, Hotpot QA, Mu Si Que and 2Wiki Multi Hop QA (Press et al., 2023; Yang et al., 2018; Trivedi et al., 2022; Ho et al., 2020). Also, Wikipedia passages from the KILT benchmark as the document source (Petroni et al., 2020).
Dataset Splits No To manage the computational costs of extensive experiments, we follow Wu et al. (2024); Gutiérrez et al. (2024) and sample 1.2k examples from each dataset for evaluation. The paper mentions sampling for evaluation but does not provide specific train/test/validation splits.
Hardware Specification No For generation, we utilize Gemini 1.5 Flash for more efficient experiments. The paper mentions the language model used but does not provide specific hardware details (e.g., GPU/CPU models, memory) on which the experiments were conducted.
Software Dependencies No The paper mentions using specific models like 'Gecko-1B (en) embedding model' and 'Gemini 1.5 Flash', but does not list any specific software dependencies with version numbers used for implementing their methodology.
Experiment Setup Yes For the parameter space Θ of DRAG, we consider the number of documents k {0, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000}, and the number in-context examples m ranging from 0, 20, 21, ..., to 28. For Iter DRAG, we further experiment with number of iterations n up to 5. We allow up to five iterations, after which the model is forced to produce the final answer. Each document is then truncated on the right side to a maximum of 1024 tokens.