Semi-Parametric Retrieval via Binary Bag-of-Tokens Index
Authors: Jiawei Zhou, Li Dong, Furu Wei, Lei Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comprehensive evaluation across 16 retrieval benchmarks demonstrates that SIDR outperforms both neural and term-based retrieval baselines under the same indexing workload: (i) When using an parametric embedding-based index, SIDR exceeds the performance of conventional neural retrievers while maintaining similar training complexity; (ii) When using a non-parametric tokenization-based index, SIDR matches the complexity of traditional term-based retrieval BM25, while consistently outperforming it on in-domain datasets; (iii) Additionally, we introduce a late parametric mechanism that matches BM25 index preparation time for search while outperforming both BM25 and other neural retrieval baselines in effectiveness. |
| Researcher Affiliation | Collaboration | Jiawei Zhou1,3 Li Dong2 Furu Wei2 Lei Chen1,3 The Hong Kong University of Science and Technology1 Microsoft Research2 The Hong Kong University of Science and Technology (Guangzhou)3 EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and equations, such as for Vθ(x) and VBo T(x), but does not present these in a structured pseudocode block or algorithm format. |
| Open Source Code | Yes | Code is available at https://github.com/jzhoubu/sidr. |
| Open Datasets | Yes | Wiki21m benchmark. Following established benchmark in retrieval literature (Chen et al., 2017; Karpukhin et al., 2020), we train our model on the training splits of Natural Questions (NQ; Kwiatkowski et al., 2019), Trivia QA (TQA; Joshi et al., 2017), and Web Questions (WQ; Berant et al., 2013) datasets, and evaluated it on their respective test splits. The retrieval corpus used is Wikipedia, which contains over 21 million 100-word passages. BEIR benchmark. We train our model on MS MARCO passage ranking dataset (Bajaj et al., 2016), which consists of approximately 8.8 million passages with around 500 thousand queries. The performance is assessed both in-domain on MS MARCO and in a zero-shot setting across 12 diverse datasets within the BEIR benchmark (Thakur et al., 2021). |
| Dataset Splits | Yes | Wiki21m benchmark. Following established benchmark in retrieval literature (Chen et al., 2017; Karpukhin et al., 2020), we train our model on the training splits of Natural Questions (NQ; Kwiatkowski et al., 2019), Trivia QA (TQA; Joshi et al., 2017), and Web Questions (WQ; Berant et al., 2013) datasets, and evaluated it on their respective test splits. |
| Hardware Specification | Yes | For computational devices, our systems are equipped with 4 NVIDIA A100 GPUs and Intel Xeon Platinum 8358 CPUs. |
| Software Dependencies | No | The paper mentions using Python, PyTorch's sparse module, Pyserini, Java, and Lucene, but does not specify version numbers for any of these software components. For example, it states: 'Our implementation is in Python, leveraging Py Torch s sparse module1' and 'For BM25, we utilize Pyserini (Lin et al., 2021), a library based on a Java implementation developed around Lucene.' |
| Experiment Setup | Yes | For the NQ, TQA, and WQ datasets, our model is trained for 80 epochs, utilizing in-training retrieval for negative sampling. For the MS MARCO dataset, the training duration is set to 40 epochs. We utilize a batch size of 128 and an Adam W optimizer (Loshchilov & Hutter, 2018) with a learning rate set at 2 10 5. Our model use a top-k sparsification with k = 768, matching the dimensionality of conventional dense retrieval embeddings. |