reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multivariate Dense Retrieval: A Reproducibility Study under a Memory-limited Setup

Authors: Georgios Sidiropoulos, Samarth Bhargav, Panagiotis Eustratiadis, Evangelos Kanoulas

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we attempt to reproduce MRL under memory constraints (e.g., an academic computational budget). In particular, we focus on a memory-limited, single GPU setup. ... Additionally, we expand on the results from the original paper with a thorough ablation study which provides more insight into the impact of the framework s different components. While we confirm that MRL can have state-of-the-art performance, we could not reproduce the results reported in the original paper or uncover the reported trends against the baselines under a memory-limited setup that facilitates fair comparisons of MRL against its baselines. Our analysis offers insights as to why that is the case. Most importantly, our empirical results suggest that the variance definition in MRL does not consistently capture uncertainty.
Researcher Affiliation	Academia	Georgios Sidiropoulos EMAIL IRLab, University of Amsterdam, Amsterdam, The Netherlands Samarth Bhargav EMAIL IRLab, University of Amsterdam, Amsterdam, The Netherlands Panagiotis Eustratiadis EMAIL IRLab, University of Amsterdam, Amsterdam, The Netherlands Evangelos Kanoulas EMAIL IRLab, University of Amsterdam, Amsterdam, The Netherlands
Pseudocode	No	The paper describes methodologies using mathematical formulations and descriptive text, for example, in Section 3 'Methodology' and its subsections. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps.
Open Source Code	Yes	The source code for our reproducibility study is available at: https://github.com/samarthbhargav/multivariate_ir/
Open Datasets	Yes	Our evaluation is performed on both in-domain (ID) and out-of-domain (OOD) data. ... In Domain (ID). We train all models on the MS-MARCO (Nguyen et al., 2016) training set. ... There are three in-domain evaluation sets, all of which are based on the MS-MARCO corpus. This includes the MS-MARCO Dev set, the TREC-DL 2019 (Craswell et al., 2020) and TREC-DL 2020 (Craswell et al., 2021) datasets. ... Out of Domain (OOD). ... Sci Fact (Wadden et al., 2020): a scientific claim verification dataset... Fi QA (Maia et al., 2018): a dataset that involves retrieval of documents in the financial domain... TRECCOVID (Voorhees et al., 2021): a biomedical dataset... CQADup Stack (Hoogeveen et al., 2015): a community question answering (CQA) dataset... We experiment with the DL-Typo (Zhuang & Zuccon, 2022) dataset...
Dataset Splits	Yes	Since MS-MARCO does not include a validation set, we split the training set into a validation (6890 queries) and a training set.
Hardware Specification	Yes	With that setup we use a batch size of 15 queries the maximum that can fit in a 40GB A100 GPU, given the size of our model.
Software Dependencies	No	We use the Tevatron toolkit (Gao et al., 2023) to train the models and the pytrec_eval library (Van Gysel & de Rijke, 2018) to evaluate the retrieval performance. Finally, our QPP baselines are based on an existing implementation by Meng et al. (2023). The paper mentions software tools like Tevatron and pytrec_eval, and a cross-encoder model available on Hugging Face, but it does not specify explicit version numbers for these software dependencies, which is required for a reproducible description.
Experiment Setup	Yes	We train MRL for 200K steps. In each step, we optimize the distillation loss (Eq. 12) using a batch of queries, one positive passage per query, and 30 negative passages per query; 5 of the negative passages are mined with BM25, and 25 are mined with the student model. With that setup we use a batch size of 15 queries... We set the maximum length for queries and passages to 32 and 256 tokens, respectively. We initialize the dense retriever student model with the official TAS-B checkpoint... We use an Adam optimizer with a learning rate of 5 10 6, and linear learning rate scheduling with warm-up for 10% of the training steps. The β parameter for softplus is set to 2.5. ... The MRL models reported use means and variances projected down to 383 (= 768 / 2 + 1). ... Refer to Appendix E for the full set of hyperparameters.