Contextual Document Embeddings
Authors: John X. Morris, Alexander Rush
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve stateof-the-art results on the MTEB benchmark... We consider a range of retrieval experiments across different scales. ... In this scenario, we evaluate on a truncated version of the BEIR benchmark (Thakur et al., 2021). ... We evaluate our models using NDCG@10, a conventional retrieval metric that enables comparison across many disparate datasets. |
| Researcher Affiliation | Academia | John X. Morris Cornell University EMAIL Alexander M. Rush Cornell University EMAIL |
| Pseudocode | No | The paper describes methods in prose and with mathematical equations, but does not include a distinct section or figure explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We train on the meta-datasets collected in Nussbaum et al. (2024) for training text embedding models. ... The supervised training phase includes 1.8M human-written query-document pairs ... aggregated from popular retrieval datasets such as Hotpot QA and MS MARCO (Yang et al., 2018; Bajaj et al., 2018). ... We evaluate on a truncated version of the BEIR benchmark (Thakur et al., 2021). ... evaluating on the full MTEB benchmark (Muennighoff et al., 2022). |
| Dataset Splits | Yes | We evaluate on a truncated version of the BEIR benchmark (Thakur et al., 2021). ... evaluating on the full MTEB benchmark (Muennighoff et al., 2022). ... The supervised training phase includes 1.8M human-written query-document pairs intended for text retrieval, and is aggregated from popular retrieval datasets such as Hotpot QA and MS MARCO (Yang et al., 2018; Bajaj et al., 2018). |
| Hardware Specification | No | The paper states 'Thanks to Nomic and Hyperbolic for providing the compute necessary to conduct this research.' but does not specify any particular GPU models, CPU models, or detailed computer specifications used for the experiments. |
| Software Dependencies | No | The paper mentions software components such as FAISS, Nomic BERT, BERT-base, Flash Attention, and Adam optimizer, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We train with the Adam optimizer with 1000 steps of warmup to a learning rate of 2 * 10^-5 and linearly decay to 0 throughout training. For all experiments, we train with the Adam optimizer with 1000 steps of warmup to a learning rate of 2 * 10^-5 and linearly decay to 0 throughout training. We train for three epochs unless otherwise specified. We set the maximum sequence length for all inputs to 512 and the number of contextual inputs to 512 (so the second-stage model has an input length of 1024). When computing contrastive loss, we use a fixed temperature of ϱ = 0.02. When sequence dropout is enabled in our contextual architecture, we set contextual input tokens to null vectors with a uniform probability p = 0.005. ... we are able to pre-train and fine-tune both biencoder and contextual models across a variety of batch sizes in {256, 512, 1024, 2048, 4096} and cluster sizes {64, 256, 1024, 4096, ..., 2097152, 4194304}. |