reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RAGGED: Towards Informed Design of Scalable and Stable RAG Systems

Authors: Jennifer Hsia, Afreen Shaikh, Zora Zhiruo Wang, Graham Neubig

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through large-scale experiments on open-domain, multi-hop, and specialized-domain datasets, we show that retrievers, rerankers, and prompts influence performance but do not fundamentally alter these reader-driven trends.
Researcher Affiliation	Academia	1Machine Learning Department, Carnegie Mellon University, Pittsburgh, USA 2The Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA. Correspondence to: Jennifer Hsia <EMAIL>.
Pseudocode	No	The paper defines RAG Stability Score (RSS) and RAG Scalability Coefficient (RSC) using mathematical formulas and describes methodologies in prose, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data for the RAGGED framework are available at https://github.com/neulab/ragged
Open Datasets	Yes	Natural Questions (NQ) (Kwiatkowski et al., 2019): Wikipedia-based, single-hop QA with real user queries. Hotpot QA (Yang et al., 2018): Wikipedia-based, multi-hop QA requiring reasoning over multiple passages. Bio ASQ (Task 11B) (Krithara et al., 2023): Pub Med-based biomedical QA for specialized domains. For NQ and Hotpot QA datasets in the open domain, we use the Wikipedia paragraphs corpus provided by the KILT benchmark (Petroni et al., 2021). For Bio ASQ, we use the Pub Med Annual Baseline Repository for 2023 (of Medicine, 2023).
Dataset Splits	No	For NQ and Hotpot QA, we use KILT’s dev set versions of the datasets, allowed under the MIT License (Petroni et al., 2021). For Bio ASQ (Krithara et al., 2023), we use Task 11B, distributed under CC BY 2.5 license. While the paper specifies the versions of the datasets used, it does not explicitly provide information on specific training, validation, or test splits (e.g., percentages, sample counts, or predefined split names) beyond using the development sets of established benchmarks.
Hardware Specification	Yes	The experiments were conducted on NVIDIA A6000 GPUs, supported by an environment with 60GB RAM.
Software Dependencies	No	When using FLANT5 and FLANUL2 readers, we use T5Tokenizer to truncate sequences to up to 2k tokens; when using LLAMA models, we apply the Llama Tokenizer and truncate sequences by 4k tokens for LLAMA2 and 8k for LLAMA3. The paper mentions specific tokenizers used but does not provide version numbers for these or other crucial software dependencies like Python, PyTorch, or CUDA versions.
Experiment Setup	Yes	For our reader decoding strategy, we used greedy decoding with a beam size of 1 and temperature of 1, selecting the most probable next word at each step without sampling. The output generation was configured to produce responses with 10 tokens. For all experiments, we use the following prompt: Instruction: Give simple short one phrase answers for the questions based on the context Context: [passage1, passage2, , passagek] Question: [the question of the current example] Answer: