reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Provence: efficient and robust context pruning for retrieval-augmented generation

Authors: Nadezhda Chirkova, Thibault Formal, Vassilina Nikoulina, Stéphane Clinchant

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results show that Provence enables context pruning with negligible to no drop in performance, in various domains and settings, at almost no cost in a standard RAG pipeline. We also conduct a deeper analysis alongside various ablations to provide insights into training context pruners for future work.
Researcher Affiliation	Industry	Nadezhda Chirkova, Thibault Formal, Vassilina Nikoulina, St ephane Clinchant Naver Labs Europe Grenoble, France EMAIL
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes methodologies in narrative text and uses mathematical equations, but no structured pseudocode.
Open Source Code	Yes	Our model1 and code2 are publicly available. 1Our model: https://huggingface.co/naver/provence-reranker-debertav3-v1 2Our code: https://github.com/naver/bergen/tree/main/scripts/provence
Open Datasets	Yes	Our approach requires a set of training questions and a retrieval datastore. Speficially, we rely on the train set of the MS MARCO document ranking collection which includes 370k queries (Nguyen et al., 2016). In ablations, we also consider the train set of Natural Questions which contains 87k queries Kwiatkowski et al., 2019). We test Provence on a diverse set of QA datasets. First, we consider commonly used datasets relying on Wikipedia datastore: Natural Questions (Kwiatkowski et al., 2019), Ty Di QA (Clark et al., 2020), Pop QA (Mallen et al., 2023b) (all three datasets include single-hop questions), and Hotpot QA (Yang et al., 2018) (multi-hop questions). Second, we consider datasets with datastores from various domains: Bio ASQ (Nentidis et al., 2023) (biomedical questions with Pubmed as a datastore), Syllabus QA (Fernandez et al., 2024) (questions about educational course logistics, with courses syllabus as a datastore); and RGB Chen et al. (2024b) (questions about news with Google-searched news as contexts).
Dataset Splits	Yes	Our approach requires a set of training questions and a retrieval datastore. Speficially, we rely on the train set of the MS MARCO document ranking collection which includes 370k queries (Nguyen et al., 2016). In ablations, we also consider the train set of Natural Questions which contains 87k queries Kwiatkowski et al., 2019). We test Provence on a diverse set of QA datasets. First, we consider commonly used datasets relying on Wikipedia datastore: Natural Questions (Kwiatkowski et al., 2019), Ty Di QA (Clark et al., 2020), Pop QA (Mallen et al., 2023b) (all three datasets include single-hop questions), and Hotpot QA (Yang et al., 2018) (multi-hop questions). Second, we consider datasets with datastores from various domains: Bio ASQ (Nentidis et al., 2023) (biomedical questions with Pubmed as a datastore), Syllabus QA (Fernandez et al., 2024) (questions about educational course logistics, with courses syllabus as a datastore); and RGB Chen et al. (2024b) (questions about news with Google-searched news as contexts). ... We use a test set of 2.8k questions, distributed as a part of the KILT collection (https://huggingface.co/datasets/facebook/kilt_tasks); Hotpot QA (Yang et al., 2018). We use a test set of 5.6k questions, distributed as a part of the KILT collection (https://huggingface.co/datasets/facebook/ kilt_tasks); Pop QA (Mallen et al., 2023b). We use a test set of 14k questions distributed by the dataset authors.
Hardware Specification	Yes	All runs were performed on single Tesla V100-SXM2-32GB GPU with vllm Kwon et al. (2023).
Software Dependencies	No	The paper mentions "Py Torch (Paszke et al., 2019) and Hugging Face transformers (Wolf et al., 2020)" and "vllm Kwon et al. (2023)" as well as "nltk.sent_tokenize function" but does not specify explicit version numbers for these software components.
Experiment Setup	Yes	After preliminary experiments, we set the learning rate to 3e-6, the batch size to 48 and train models for one epoch. For joint training, there is a slight trade-off between pruning and reranking. We set the reranking regularization coefficient λ to 0.05, chosen as the minimal value that does not substantially degrade reranking performance on the MS MARCO development set.