reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unsupervised Dense Information Retrieval with Contrastive Learning

Authors: Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, Edouard Grave

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we explore the limits of contrastive learning as a way to train unsupervised dense retrievers and show that it leads to strong performance in various retrieval settings. On the BEIR benchmark our unsupervised model outperforms BM25 on 11 out of 15 datasets for the Recall@100. ... We perform ablations to motivate our design choices, and show that cropping works better than the inverse Cloze task. ... In this section, we empirically evaluate our best retriever trained with contrastive learning, called Contriever (contrastive retriever), which uses Mo Co with random cropping. ... Section 4: Experiments. ... Section 6: Ablation studies.
Researcher Affiliation	Collaboration	Meta AI Research, Ecole normale supérieure, PSL University, Inria, Université Grenoble Alpes, University College London
Pseudocode	No	The paper describes methods in prose, occasionally using mathematical formulas, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures with structured, code-like steps.
Open Source Code	Yes	Code and pre-trained models are available here: https://github.com/facebookresearch/contriever.
Open Datasets	Yes	First, we evaluate our model on two question answering datasets: Natural Questions (Kwiatkowski et al., 2019) and Trivia QA (Joshi et al., 2017). ... Second, we use the BEIR benchmark, introduced by Thakur et al. (2021), which contains 18 retrieval datasets... we use a simple strategy for negative mining and do not use distillation. Our model would probably also beneﬁts from improvements proposed by these retrievers, but this is beyond the scope of this paper.
Dataset Splits	Yes	For Sci Fact, we hold out randomly 10% of the training data and use them as development set, leading to a train set containing 729 samples.
Hardware Specification	No	In these ablations, all the models are pre-trained on English Wikipedia for 200k gradient steps, with a batch size of 2,048 (on 32 GPUs).
Software Dependencies	No	The paper mentions several components like 'Adam W optimizer (Loshchilov & Hutter, 2019)', 'ASAM optimizer (Kwon et al., 2021)', and 'BERT base uncased model', but does not provide specific version numbers for any software libraries or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We use the Mo Co algorithm He et al. (2020) with a queue of size 131,072, a momentum value of 0.9995 and a temperature of 0.05. ... We optimize the model with the Adam W (Loshchilov & Hutter, 2019) optimizer, with learning rate of 5 10 5, batch size of 2,048 and 500,000 steps. ... For the ﬁne-tuning on MS MARCO we do not use the Mo Co algorithm and simply use in-batch negatives. We use the ASAM optimizer (Kwon et al., 2021), with a learning rate of 10 5 and a batch size of 1024 with a temperature of 0.05, also used during pre-training. We train an initial model with random negative examples for 20000 steps... For the few-shot evaluation presented in Table 3, we train for 500 epochs on each dataset with a batch size of 256 with in-batch random negatives.