reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers

Authors: Leonid Boytsov, Preksha Patel, Vivek Sourabh, Riddhi Nisar, Sayani Kundu, Ramya Ramanathan, Eric Nyberg

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We carried out a reproducibility study of In Pars, which is a method for unsupervised training of neural rankers (Bonifacio et al., 2022). As a by-product, we developed In Pars-light, which is a simple-yet-effective modification of In Pars. ... On all five English retrieval collections (used in the original In Pars study) we obtained substantial (7%-30%) and statistically significant improvements over BM25 (in n DCG and MRR) using only a 30M parameter six-layer Mini LM-30M ranker and a single three-shot prompt. ... Our detailed experimental results are presented in Table 3.
Researcher Affiliation	Collaboration	Leonid Boytsov EMAIL Amazon AWS AI Labs Pittsburgh USA Preksha Patel Vivek Sourabh Riddhi Nisar Sayani Kundu Ramya Ramanathan Eric Nyberg Carnegie Mellon University Pittsburgh USA
Pseudocode	No	The paper describes methods like the Information Retrieval Pipeline and the In Pars-light Training Recipe in descriptive text form and provides a prompt template in Table 2, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and data is publicly available. https://github.com/searchivarius/inpars_light/
Open Datasets	Yes	Because we aimed to reproduce the main results of In Pars (Bonifacio et al., 2022), we used exactly the same set of queries and datasets, which are described below. Except MS MARCO (which was processed directly using Flex Neu ART Boytsov & Nyberg (2020) scripts), datasets were ingested with a help of the IR datasets package (Mac Avaney et al., 2021). ... MS MARCO (Bajaj et al., 2016; Craswell et al., 2020)... Robust04 (Voorhees, 2004)... Natural Questions (NQ) BEIR (Kwiatkowski et al., 2019)... TREC COVID BEIR (Roberts et al., 2020)...
Dataset Splits	Yes	For each collection Bonifacio et al. (2022) generated 100K synthetic queries and retained only 10K with the highest average log-probabilities. ... MS MARCO development set with approximately 6.9K sparsely-judged queries and the TREC DL 2020 (Craswell et al., 2020) collection of 54 densely judged queries. ... Robust04 (Voorhees, 2004) is a small... with a small but densely judged set of 250 queries... Natural Questions (NQ) BEIR... with 3.4K sparsely-judged queries... TREC COVID BEIR... with 50 densely-judged queries...
Hardware Specification	Yes	it takes only about 15 hours to generate 100K queries using RTX 3090 GPU. Extrapolating this estimate to A100, which is about 2x faster than RTX 309015, and using the pricing of Lambda GPU cloud, we estimate the cost of generation in our In Pars-light study to be under $10 per collection. ... on a reasonably modern GPU (such as RTX 3090)... Here, all training times are given with respect to a single RTX 3090 GPU.
Software Dependencies	No	Both generative and ranking models were implemented using Py Torch and Huggingface (Wolf et al., 2020). ... Except MS MARCO (which was processed directly using Flex Neu ART Boytsov & Nyberg (2020) scripts), datasets were ingested with a help of the IR datasets package (Mac Avaney et al., 2021). ... We used the Adam W optimizer (Loshchilov & Hutter, 2017)... The paper mentions software tools like PyTorch, Huggingface, Flex Neu ART, IR datasets, and AdamW, but does not specify their version numbers, which is required for reproducibility.
Experiment Setup	Yes	Ranking models were trained using the Info NCE loss (Le-Khac et al., 2020). In a single training epoch, we selected randomly one pair of positive and three negative examples per query (negatives were sampled from 1000 documents with highest BM25 scores). ... We used the Adam W optimizer (Loshchilov & Hutter, 2017) with a small weight decay (10 7), a warm-up schedule, and a batch size of 16.11 We used different base rates for the fully-connected prediction head (2 10 4) and for the main Transformer layers (2 10 5). The mini-batch size was equal to one and a larger batch size was simulated using a 16-step gradient accumulation. ... We trained each ranking model using three seeds and reported the average results... The maximum number of new tokens generated for each example was set to 32.