reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Inference Scaling for Long-Context Retrieval Augmented Generation

Authors: Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on benchmark QA datasets, we demonstrate an almost linear relationship between RAG performance and the scale of effective context length by combining both RAG strategies, as shown in Figure 1 (right).
Researcher Affiliation	Collaboration	Zhenrui Yue 1, Honglei Zhuang 2, Aijun Bai2, Kai Hui2, Rolf Jagerman2, Hansi Zeng 3 Zhen Qin2, Dong Wang1, Xuanhui Wang2, Michael Bendersky2 1University of Illinois Urbana-Champaign, 2Google Deep Mind, 3UMass Amherst EMAIL, EMAIL
Pseudocode	No	The paper describes methods (DRAG, Iter DRAG) in text and shows example prompts in Figure 16, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing its own source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	We evaluate the performance of Gemini 1.5 Flash with context length window up to 1M tokens on knowledge-intensive question answering, including multi-hop datasets Bamboogle, Hotpot QA, Mu Si Que and 2Wiki Multi Hop QA (Press et al., 2023; Yang et al., 2018; Trivedi et al., 2022; Ho et al., 2020). Also, Wikipedia passages from the KILT benchmark as the document source (Petroni et al., 2020).
Dataset Splits	No	To manage the computational costs of extensive experiments, we follow Wu et al. (2024); Gutiérrez et al. (2024) and sample 1.2k examples from each dataset for evaluation. The paper mentions sampling for evaluation but does not provide specific train/test/validation splits.
Hardware Specification	No	For generation, we utilize Gemini 1.5 Flash for more efficient experiments. The paper mentions the language model used but does not provide specific hardware details (e.g., GPU/CPU models, memory) on which the experiments were conducted.
Software Dependencies	No	The paper mentions using specific models like 'Gecko-1B (en) embedding model' and 'Gemini 1.5 Flash', but does not list any specific software dependencies with version numbers used for implementing their methodology.
Experiment Setup	Yes	For the parameter space Θ of DRAG, we consider the number of documents k {0, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000}, and the number in-context examples m ranging from 0, 20, 21, ..., to 28. For Iter DRAG, we further experiment with number of iterations n up to 5. We allow up to five iterations, after which the model is forced to produce the final answer. Each document is then truncated on the right side to a maximum of 1024 tokens.