Distribution-Driven Dense Retrieval: Modeling Many-to-One Query-Document Relationship

Authors: Junfeng Kang, Rui Li, Qi Liu, Zhenya Huang, Zheng Zhang, Yanjiang Chen, Linbo Zhu, Yu Su

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we conduct extensive experiments on real-world datasets, which demonstrate that our method significantly outperforms traditional dense retrieval methods.
Researcher Affiliation Academia 1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3School of Computer Science and Artificial Intelligence, Hefei Normal University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology through textual explanations, mathematical equations, and a framework diagram (Figure 2), but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/tojunfeng/DDR
Open Datasets Yes We conducted experiments on MS MARCO (Nguyen et al. 2016), TREC Track 2019 and 2020 (Craswell et al. 2020), followed by additional experiments on zero-shot datasets (Thakur et al. 2021).
Dataset Splits No The paper uses standard datasets like MS MARCO and TREC DL tracks but does not explicitly provide details on how these datasets were split into training, validation, and test sets for their experiments, nor does it specify exact percentages, sample counts, or the methodology for data partitioning.
Hardware Specification No The paper mentions 'Time for Retrieval per Query on GPU' in Figure 3, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using PyTorch for implementation and training, AdamW optimizer, DistilBERT, ELECTRA architecture, and tools like FAISS, but it does not specify any version numbers for these software dependencies.
Experiment Setup Yes We implemented and trained our model using Py Torch, optimizing the network parameters with the Adam W (Loshchilov and Hutter 2017) optimizer. We applied a linear learning rate schedule with a warmup phase of 1,000 steps, setting the learning rate to 2 10 5. The parameter β was selected from {0.1, 0.2, 0.5, 1, 2, 5, 10}. The mean and variance vectors of the document distribution are set to 768 dimensions. For fair comparison with existing single-vector models, we add dense projection layers on mean vector and variance vector to make their dimensions to be 768 2 1 = 383. Following previous work(Zamani and Bendersky 2023), we used the pre-trained checkpoints provided by TAS-B (Hofst atter et al. 2021) for initialization and used Distil BERT (Sanh et al. 2019) as our initial model.