reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SiReRAG: Indexing Similar and Related Information for Multihop Reasoning

Authors: Nan Zhang, Prafulla Kumar Choubey, Alexander Fabbri, Gabriel Bernadett-Shapiro, Rui Zhang, Prasenjit Mitra, Caiming Xiong, Chien-Sheng Wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that SIRERAG consistently outperforms state-of-the-art indexing methods on three multihop datasets (Mu Si Que, 2Wiki Multi Hop QA, and Hotpot QA), with an average 1.9% improvement in F1 scores. As a reasonably efficient solution, SIRERAG enhances existing reranking methods significantly, with up to 7.8% improvement in average F1 scores.
Researcher Affiliation	Collaboration	Nan Zhang Prafulla Kumar Choubey Alexander Fabbri Gabriel Bernadett-Shapiro Rui Zhang Prasenjit Mitra Caiming Xiong Chien-Sheng Wu The Pennsylvania State University Salesforce AI Research EMAIL EMAIL
Pseudocode	No	The paper describes methodologies in detail (e.g., Section 4 METHODOLOGY) and illustrates them with diagrams (Figure 2), but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/Salesforce AIResearch/Si Re RAG.
Open Datasets	Yes	To demonstrate the effectiveness of SIRERAG, we select three representative multihop QA datasets: Mu Si Que (Trivedi et al., 2022), 2Wiki Multi Hop QA (Ho et al., 2020), and Hotpot QA (Yang et al., 2018).
Dataset Splits	Yes	Using the same corpus as Hippo RAG (Guti errez et al., 2024), we obtain 1000 questions from each validation set of these three datasets.
Hardware Specification	No	The paper mentions using various LLMs (e.g., GPT-4o, GPT-3.5-Turbo, Meta-Llama-3-70B-Instruct, Mistral-7B-Instruct-v0.3) and embedding models (Open AI text-embedding-3-small) but does not provide specific details about the hardware used to run their experiments or access these models (e.g., specific GPU/CPU models, memory configurations).
Software Dependencies	Yes	To generate final answer, we use GPT-4o and the same prompt ( answer this question in as fewer number of words as possible. ) to answer queries for all methods, since we aim to control the instruction-following capabilities across all methods. We use either GPT-3.5-Turbo or GPT-4o as the choice of LLM if any methods involve LLM calls. We use Open AI s text-embedding-3-small as the embedding model for all methods. Extracting Propositions and Entities from Documents: We define a proposition as a factual statement describing important information (preferably about some entities) from a paragraph . We extract entities and propositions using the Distill-Synth KG pipeline (Choubey et al., 2024), adapting its Synth KG workflow. First, we rewrite chunks of 10K documents from the BAAI/Industry Corpus1 to resolve entity references, using Meta-Llama-3-70B-Instruct2 (AI@Meta, 2024) with the rewriting prompt shown in Figure 6. Next, we prompt the same LLM to extract entities from these rewritten chunks (prompt is shown in Figure 7). After obtaining these entities, we again prompt the LLM to identify all relevant propositions and their associated entities (prompt is shown in Figure 8). We then consolidate the resulting propositions and entities to fine-tune Mistral-7B-Instruct-v0.33 (Jiang et al., 2023).
Experiment Setup	Yes	To generate final answer, we use GPT-4o and the same prompt ( answer this question in as fewer number of words as possible. ) to answer queries for all methods, since we aim to control the instruction-following capabilities across all methods. We use either GPT-3.5-Turbo or GPT-4o as the choice of LLM if any methods involve LLM calls. We use Open AI s text-embedding-3-small as the embedding model for all methods. During retrieval, we select top 20 candidates that match the provided query for all methods, because there is a large number of text chunks in our datasets and SIRERAG is expected to perform better when retrieving more due to the incorporation of proposition aggregates and their recursive summaries. The prompt we use to perform summarization on a cluster of nodes is summarize the provided text, including as many key details as needed . This prompt is the same as RAPTOR. In Section 3, we use identify the high-level topic of this paragraph as concise as possible to extract the topic of each passage. As mentioned in Section 4.1, the prompt used for identifying a two-level hierarchy for all chunks is shown in Figure 5. As mentioned in Section 4.2, the LLM prompt for rewriting chunks is shown in Figure 6, the prompt for extracting named entities from rewritten chunks is shown in Figure 7, and the prompt for extracting propositions is shown in Figure 8.