reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Semi-Parametric Retrieval via Binary Bag-of-Tokens Index

Authors: Jiawei Zhou, Li Dong, Furu Wei, Lei Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive evaluation across 16 retrieval benchmarks demonstrates that SIDR outperforms both neural and term-based retrieval baselines under the same indexing workload: (i) When using an parametric embedding-based index, SIDR exceeds the performance of conventional neural retrievers while maintaining similar training complexity; (ii) When using a non-parametric tokenization-based index, SIDR matches the complexity of traditional term-based retrieval BM25, while consistently outperforming it on in-domain datasets; (iii) Additionally, we introduce a late parametric mechanism that matches BM25 index preparation time for search while outperforming both BM25 and other neural retrieval baselines in effectiveness.
Researcher Affiliation	Collaboration	Jiawei Zhou1,3 Li Dong2 Furu Wei2 Lei Chen1,3 The Hong Kong University of Science and Technology1 Microsoft Research2 The Hong Kong University of Science and Technology (Guangzhou)3 EMAIL, EMAIL
Pseudocode	No	The paper describes methods and equations, such as for Vθ(x) and VBo T(x), but does not present these in a structured pseudocode block or algorithm format.
Open Source Code	Yes	Code is available at https://github.com/jzhoubu/sidr.
Open Datasets	Yes	Wiki21m benchmark. Following established benchmark in retrieval literature (Chen et al., 2017; Karpukhin et al., 2020), we train our model on the training splits of Natural Questions (NQ; Kwiatkowski et al., 2019), Trivia QA (TQA; Joshi et al., 2017), and Web Questions (WQ; Berant et al., 2013) datasets, and evaluated it on their respective test splits. The retrieval corpus used is Wikipedia, which contains over 21 million 100-word passages. BEIR benchmark. We train our model on MS MARCO passage ranking dataset (Bajaj et al., 2016), which consists of approximately 8.8 million passages with around 500 thousand queries. The performance is assessed both in-domain on MS MARCO and in a zero-shot setting across 12 diverse datasets within the BEIR benchmark (Thakur et al., 2021).
Dataset Splits	Yes	Wiki21m benchmark. Following established benchmark in retrieval literature (Chen et al., 2017; Karpukhin et al., 2020), we train our model on the training splits of Natural Questions (NQ; Kwiatkowski et al., 2019), Trivia QA (TQA; Joshi et al., 2017), and Web Questions (WQ; Berant et al., 2013) datasets, and evaluated it on their respective test splits.
Hardware Specification	Yes	For computational devices, our systems are equipped with 4 NVIDIA A100 GPUs and Intel Xeon Platinum 8358 CPUs.
Software Dependencies	No	The paper mentions using Python, PyTorch's sparse module, Pyserini, Java, and Lucene, but does not specify version numbers for any of these software components. For example, it states: 'Our implementation is in Python, leveraging Py Torch s sparse module1' and 'For BM25, we utilize Pyserini (Lin et al., 2021), a library based on a Java implementation developed around Lucene.'
Experiment Setup	Yes	For the NQ, TQA, and WQ datasets, our model is trained for 80 epochs, utilizing in-training retrieval for negative sampling. For the MS MARCO dataset, the training duration is set to 40 epochs. We utilize a batch size of 128 and an Adam W optimizer (Loshchilov & Hutter, 2018) with a learning rate set at 2 10 5. Our model use a top-k sparsification with k = 768, matching the dimensionality of conventional dense retrieval embeddings.