RAGGED: Towards Informed Design of Scalable and Stable RAG Systems
Authors: Jennifer Hsia, Afreen Shaikh, Zora Zhiruo Wang, Graham Neubig
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through large-scale experiments on open-domain, multi-hop, and specialized-domain datasets, we show that retrievers, rerankers, and prompts influence performance but do not fundamentally alter these reader-driven trends. |
| Researcher Affiliation | Academia | 1Machine Learning Department, Carnegie Mellon University, Pittsburgh, USA 2The Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA. Correspondence to: Jennifer Hsia <EMAIL>. |
| Pseudocode | No | The paper defines RAG Stability Score (RSS) and RAG Scalability Coefficient (RSC) using mathematical formulas and describes methodologies in prose, but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and data for the RAGGED framework are available at https://github.com/neulab/ragged |
| Open Datasets | Yes | Natural Questions (NQ) (Kwiatkowski et al., 2019): Wikipedia-based, single-hop QA with real user queries. Hotpot QA (Yang et al., 2018): Wikipedia-based, multi-hop QA requiring reasoning over multiple passages. Bio ASQ (Task 11B) (Krithara et al., 2023): Pub Med-based biomedical QA for specialized domains. For NQ and Hotpot QA datasets in the open domain, we use the Wikipedia paragraphs corpus provided by the KILT benchmark (Petroni et al., 2021). For Bio ASQ, we use the Pub Med Annual Baseline Repository for 2023 (of Medicine, 2023). |
| Dataset Splits | No | For NQ and Hotpot QA, we use KILT’s dev set versions of the datasets, allowed under the MIT License (Petroni et al., 2021). For Bio ASQ (Krithara et al., 2023), we use Task 11B, distributed under CC BY 2.5 license. While the paper specifies the versions of the datasets used, it does not explicitly provide information on specific training, validation, or test splits (e.g., percentages, sample counts, or predefined split names) beyond using the development sets of established benchmarks. |
| Hardware Specification | Yes | The experiments were conducted on NVIDIA A6000 GPUs, supported by an environment with 60GB RAM. |
| Software Dependencies | No | When using FLANT5 and FLANUL2 readers, we use T5Tokenizer to truncate sequences to up to 2k tokens; when using LLAMA models, we apply the Llama Tokenizer and truncate sequences by 4k tokens for LLAMA2 and 8k for LLAMA3. The paper mentions specific tokenizers used but does not provide version numbers for these or other crucial software dependencies like Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | For our reader decoding strategy, we used greedy decoding with a beam size of 1 and temperature of 1, selecting the most probable next word at each step without sampling. The output generation was configured to produce responses with 10 tokens. For all experiments, we use the following prompt: Instruction: Give simple short one phrase answers for the questions based on the context Context: [passage1, passage2, , passagek] Question: [the question of the current example] Answer: |