reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-Field Adaptive Retrieval

Authors: Millicent Li, Tongfei Chen, Ben Van Durme, Patrick Xia

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We find that our approach allows for the optimized use of dense versus lexical representations across field types, significantly improves in document ranking over a number of existing retrievers, and achieves state-of-the-art performance for multi-field semi-structured data. Our experiments are motivated by the following hypotheses: 1. Taking advantage of the multi-field document structure will lead to better accuracy than treating the document in its entirely, as a single field. 2. Hybrid (a combination of lexical and dense) approaches to modeling will perform better than using only one or other. We use STa RK (Wu et al., 2024b), a collection of three retrieval datasets in the domains of product reviews (Amazon), academic articles (MAG), and biomedical knowledge (Prime).
Researcher Affiliation	Collaboration	Millicent Li1 , Tongfei Chen2 , Benjamin Van Durme3, Patrick Xia3 1Northeastern University, 2Augment Code, 3Microsoft EMAIL, EMAIL
Pseudocode	No	The paper describes mathematical formulations for its method in Section 2.2, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code.
Open Source Code	Yes	Work done while at Microsoft 1 https://github.com/microsoft/multifield-adaptive-retrieval
Open Datasets	Yes	We use STa RK (Wu et al., 2024b), a collection of three retrieval datasets in the domains of product reviews (Amazon), academic articles (MAG), and biomedical knowledge (Prime), each derived from knowledge graphs. Amazon contains queries and documents from Amazon Product Reviews (He & Mc Auley, 2016) and Amazon Question and Answer Data (Mc Auley et al., 2015). MAG contains queries and documents about academic papers, sourced from the Microsoft Academic Graph (Wang et al., 2020), obgn-MAG, and obgn-papers100M (Hu et al., 2020). Prime contains queries and documents regarding biomedicine from Prime KG (Chandak et al., 2022).
Dataset Splits	Yes	Table 5: The corpus size, number of fields, and queries (by split) for each of the STa RK datasets. Dataset ... Num. Documents ... Train Dev. Test. Amazon 950K ... 6K 1.5K 1.5K MAG 700K ... 8K 2.6K 2.6K Prime 130K ... 6.1K 2.2K 2.8K
Hardware Specification	Yes	During training, we sample k= 1 negative example per query. Along with in-batch negatives, this results in 2b 1 negative samples for a batch size of b. This negative document is sampled using Pyserini Lucene4: 100 nearest documents are retrieved, of which the postive documents are removed. The top negative document is then sampled among that remaining set. We apply early stopping on validation loss with a patience of 5. We set τ= 0.05 and train with DDP on 8x NVIDIA A100s.
Software Dependencies	Yes	Our implementation uses Pytorch Lightning5 and sentence-transformers 2.2.2 (Reimers & Gurevych, 2019). We use a fast, python-based implementation of BM25 as our lexical scorer (L u, 2024).6
Experiment Setup	Yes	We set τ= 0.05 and train with DDP on 8x NVIDIA A100s. Contriever is a 110M parameter model, and the additional parameters added through G is negligible (768\|F\|), scaling linearly in the number of fields. We use separate learning rates (LRs) for finetuning the encoder and for the other parameters. Specifically, we searched over learning rates [5e-6, 1e-5, 5e-5, 1e-4] for the encoder and [1e-3, 5e-3, 1e-2, 5e-2, 1e-1] for the parameters in G(q, f, m) which consist of am f and γm f and βm f from batch normalization. The main grid search was conducted over the bolded values, although we found 5e-3 to be effective for G(q, f, m) for Amazon. We otherwise follow the default settings for both the optimizer (Adam W, dropout, etc.) and batch normalization (Py Torch 2.4.0).