reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contextual Document Embeddings

Authors: John X. Morris, Alexander Rush

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve stateof-the-art results on the MTEB benchmark... We consider a range of retrieval experiments across different scales. ... In this scenario, we evaluate on a truncated version of the BEIR benchmark (Thakur et al., 2021). ... We evaluate our models using NDCG@10, a conventional retrieval metric that enables comparison across many disparate datasets.
Researcher Affiliation	Academia	John X. Morris Cornell University EMAIL Alexander M. Rush Cornell University EMAIL
Pseudocode	No	The paper describes methods in prose and with mathematical equations, but does not include a distinct section or figure explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	We train on the meta-datasets collected in Nussbaum et al. (2024) for training text embedding models. ... The supervised training phase includes 1.8M human-written query-document pairs ... aggregated from popular retrieval datasets such as Hotpot QA and MS MARCO (Yang et al., 2018; Bajaj et al., 2018). ... We evaluate on a truncated version of the BEIR benchmark (Thakur et al., 2021). ... evaluating on the full MTEB benchmark (Muennighoff et al., 2022).
Dataset Splits	Yes	We evaluate on a truncated version of the BEIR benchmark (Thakur et al., 2021). ... evaluating on the full MTEB benchmark (Muennighoff et al., 2022). ... The supervised training phase includes 1.8M human-written query-document pairs intended for text retrieval, and is aggregated from popular retrieval datasets such as Hotpot QA and MS MARCO (Yang et al., 2018; Bajaj et al., 2018).
Hardware Specification	No	The paper states 'Thanks to Nomic and Hyperbolic for providing the compute necessary to conduct this research.' but does not specify any particular GPU models, CPU models, or detailed computer specifications used for the experiments.
Software Dependencies	No	The paper mentions software components such as FAISS, Nomic BERT, BERT-base, Flash Attention, and Adam optimizer, but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	We train with the Adam optimizer with 1000 steps of warmup to a learning rate of 2 * 10^-5 and linearly decay to 0 throughout training. For all experiments, we train with the Adam optimizer with 1000 steps of warmup to a learning rate of 2 * 10^-5 and linearly decay to 0 throughout training. We train for three epochs unless otherwise speciﬁed. We set the maximum sequence length for all inputs to 512 and the number of contextual inputs to 512 (so the second-stage model has an input length of 1024). When computing contrastive loss, we use a ﬁxed temperature of ϱ = 0.02. When sequence dropout is enabled in our contextual architecture, we set contextual input tokens to null vectors with a uniform probability p = 0.005. ... we are able to pre-train and ﬁne-tune both biencoder and contextual models across a variety of batch sizes in {256, 512, 1024, 2048, 4096} and cluster sizes {64, 256, 1024, 4096, ..., 2097152, 4194304}.