Nomic Embed: Training a Reproducible Long Context Text Embedder
Authors: Zach Nussbaum, John Xavier Morris, Andriy Mulyar, Brandon Duderstadt
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both Open AI Ada-002 and Open AI text-embedding-3-small on the short-context MTEB benchmark and the long context Lo Co benchmark. We release the training code and model weights under an Apache 2.0 license. In this paper, we present an end-to-end training pipeline for a state of the art long context text embedding model at only 137 million parameters. nomic-embed-text-v1 outperforms Open AI text-embedding-ada and text-embedding-3-small performance on short context (MTEB) and long context benchmarks (Lo Co) (Table 1). To evaluate nomic-embed-text-v1 effectiveness as a text encoder, we evaluate it on MTEB (Muennighoff et al., 2023), Jina s Long Context Benchmark (Günther et al., 2024), and Lo Co (Saad-Falcon et al., 2024). |
| Researcher Affiliation | Collaboration | Zach Nussbaum EMAIL Nomic AI John X. Morris EMAIL, EMAIL Nomic AI, Cornell University Brandon Duderstadt EMAIL Nomic AI Andriy Mulyar EMAIL Nomic AI |
| Pseudocode | No | The paper describes methods and training modifications using prose and mathematical equations (Equation 1, Equation 2) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the training code and model weights under an Apache 2.0 license. [...] You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors. |
| Open Datasets | Yes | In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors. You can access the training data of nomic-embed-text-v1 by visiting the code repository . You can explore a 5M sample of our contrastive training pairs at https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample. Weakly-supervised contrastive pretraining datasets are detailed in Table 7. |
| Dataset Splits | No | The paper mentions using specific datasets for training and finetuning (e.g., Books Corpus, Wikipedia dump, 29 datasets for weakly-supervised contrastive pretraining, MSMarco, NQ, NLI, etc. for supervised finetuning), and evaluates on benchmarks like MTEB, Jina Long Context, and Lo Co, but it does not explicitly provide specific training/test/validation splits (percentages, counts, or explicit splitting methodology) for the data used in their model's training beyond using the datasets as a whole or for specific tasks. |
| Hardware Specification | Yes | Full training of nomic-embed-text-v1 can be conducted in a single week on one 8x H100 node. |
| Software Dependencies | No | The paper mentions several software tools and libraries like Adam W optimizer, Deep Speed (Rajbhandari et al., 2020) stage 2, Grad Cache (Luyu Gao & Callan, 2021), and refers to using the bert-base-uncased tokenizer, Flash Attention (Dao et al., 2022) repository, but it does not specify version numbers for these software components or any other key libraries (e.g., PyTorch, Hugging Face Transformers). |
| Experiment Setup | Yes | To train a long sequence length and efficient BERT, we adapt the BERT architecture. [...] Setting Dropout to 0 [...] Vocab size as a multiple of 64 [...] resulting in a 137 million parameter encoder. We train all stages with a max sequence length of 2048 [...]. We use a 30% masking rate instead of 15% [...]. We use the Adam W optimizer (Loshchilov & Hutter, 2019) with a max learning rate of 5e-4 with β1 = 0.9 β2 = 0.98. We employ a linear warmup of 6% of the total training steps and a linear decay to 0. We use a global batch size of 4096 with gradient accumulation over 8 batches. We utilize Deep Speed (Rajbhandari et al., 2020) stage 2 to fit larger batches into memory. Additionally, we use bfloat16 mixed precision and fp32 for gradient accumulation dtype. We disable gradient clipping (Liu et al., 2019) and set weight decay to 1e-5. [...] We use a global batch size of 16,384. We use Adam W with a learning rate of 2e-4, β1 = 0.9, β2 = 0.999, and weight decay of 0.01. Gradient clipping is set to 1.0. We use a linear warmup schedule of 700 steps and an inverse square root decay schedule. [...] We train for one epoch using seven hard negatives per pair and a batch size of 256. We employ a learning rate of 2e-5, β1 = 0.9, β2 = 0.999, and weight decay of 0.01. Gradient clipping is set to 1.0. We use a linear warmup schedule of 400 steps and a linear cooldown to 0 and train with prefixes as described above. |