SiReRAG: Indexing Similar and Related Information for Multihop Reasoning
Authors: Nan Zhang, Prafulla Kumar Choubey, Alexander Fabbri, Gabriel Bernadett-Shapiro, Rui Zhang, Prasenjit Mitra, Caiming Xiong, Chien-Sheng Wu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that SIRERAG consistently outperforms state-of-the-art indexing methods on three multihop datasets (Mu Si Que, 2Wiki Multi Hop QA, and Hotpot QA), with an average 1.9% improvement in F1 scores. As a reasonably efficient solution, SIRERAG enhances existing reranking methods significantly, with up to 7.8% improvement in average F1 scores. |
| Researcher Affiliation | Collaboration | Nan Zhang Prafulla Kumar Choubey Alexander Fabbri Gabriel Bernadett-Shapiro Rui Zhang Prasenjit Mitra Caiming Xiong Chien-Sheng Wu The Pennsylvania State University Salesforce AI Research EMAIL EMAIL |
| Pseudocode | No | The paper describes methodologies in detail (e.g., Section 4 METHODOLOGY) and illustrates them with diagrams (Figure 2), but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/Salesforce AIResearch/Si Re RAG. |
| Open Datasets | Yes | To demonstrate the effectiveness of SIRERAG, we select three representative multihop QA datasets: Mu Si Que (Trivedi et al., 2022), 2Wiki Multi Hop QA (Ho et al., 2020), and Hotpot QA (Yang et al., 2018). |
| Dataset Splits | Yes | Using the same corpus as Hippo RAG (Guti errez et al., 2024), we obtain 1000 questions from each validation set of these three datasets. |
| Hardware Specification | No | The paper mentions using various LLMs (e.g., GPT-4o, GPT-3.5-Turbo, Meta-Llama-3-70B-Instruct, Mistral-7B-Instruct-v0.3) and embedding models (Open AI text-embedding-3-small) but does not provide specific details about the hardware used to run their experiments or access these models (e.g., specific GPU/CPU models, memory configurations). |
| Software Dependencies | Yes | To generate final answer, we use GPT-4o and the same prompt ( answer this question in as fewer number of words as possible. ) to answer queries for all methods, since we aim to control the instruction-following capabilities across all methods. We use either GPT-3.5-Turbo or GPT-4o as the choice of LLM if any methods involve LLM calls. We use Open AI s text-embedding-3-small as the embedding model for all methods. Extracting Propositions and Entities from Documents: We define a proposition as a factual statement describing important information (preferably about some entities) from a paragraph . We extract entities and propositions using the Distill-Synth KG pipeline (Choubey et al., 2024), adapting its Synth KG workflow. First, we rewrite chunks of 10K documents from the BAAI/Industry Corpus1 to resolve entity references, using Meta-Llama-3-70B-Instruct2 (AI@Meta, 2024) with the rewriting prompt shown in Figure 6. Next, we prompt the same LLM to extract entities from these rewritten chunks (prompt is shown in Figure 7). After obtaining these entities, we again prompt the LLM to identify all relevant propositions and their associated entities (prompt is shown in Figure 8). We then consolidate the resulting propositions and entities to fine-tune Mistral-7B-Instruct-v0.33 (Jiang et al., 2023). |
| Experiment Setup | Yes | To generate final answer, we use GPT-4o and the same prompt ( answer this question in as fewer number of words as possible. ) to answer queries for all methods, since we aim to control the instruction-following capabilities across all methods. We use either GPT-3.5-Turbo or GPT-4o as the choice of LLM if any methods involve LLM calls. We use Open AI s text-embedding-3-small as the embedding model for all methods. During retrieval, we select top 20 candidates that match the provided query for all methods, because there is a large number of text chunks in our datasets and SIRERAG is expected to perform better when retrieving more due to the incorporation of proposition aggregates and their recursive summaries. The prompt we use to perform summarization on a cluster of nodes is summarize the provided text, including as many key details as needed . This prompt is the same as RAPTOR. In Section 3, we use identify the high-level topic of this paragraph as concise as possible to extract the topic of each passage. As mentioned in Section 4.1, the prompt used for identifying a two-level hierarchy for all chunks is shown in Figure 5. As mentioned in Section 4.2, the LLM prompt for rewriting chunks is shown in Figure 6, the prompt for extracting named entities from rewritten chunks is shown in Figure 7, and the prompt for extracting propositions is shown in Figure 8. |