Repetition Improves Language Model Embeddings
Authors: Jacob Springer, Suhas Kotha, Daniel Fried, Graham Neubig, Aditi Raghunathan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We challenge this premise by introducing echo embeddings which converts autoregressive LMs into high quality text embedding models without changing the architecture or requiring fine-tuning. By repeating the input and extracting embeddings from the repeated tokens which have access to all original tokens echo embeddings improve over classical LM embeddings by over 5% in zero-shot settings. Our zero-shot embeddings nearly match those obtained by bidirectionally-converted LMs that undergo additional masked-language modeling training. Echo embeddings are also compatible with supervised fine-tuning, matching or outperforming bidirectionally-converted LMs in an apples-to-apples comparison, even with an identical compute budget during training and inference. Overall, repetition is a simple and effective strategy to circumvent the need for bidirectional attention in embedding models, paving the way towards a unified architecture for all NLP tasks. ... In this section, we describe how we implement and evaluate echo embeddings on real data. |
| Researcher Affiliation | Academia | Jacob Mitchell Springer Suhas Kotha Daniel Fried Graham Neubig Aditi Raghunathan Carnegie Mellon University |
| Pseudocode | No | The paper describes the 'Echo embeddings method' verbally in Section 3.2: 'Prompt the language model to act as an autoencoder, e.g., by asking the model to {repeat, rephrase, fix, etc.} the input; feed the sentence x to the language model twice; pool the contextualized embeddings of the second occurrence of x.' This is a textual description of the process, not structured pseudocode or an algorithm block. |
| Open Source Code | No | The paper makes no explicit statement about releasing its source code for the 'echo embeddings' methodology. It mentions an MTEB leaderboard and provides example prompts and datasets in appendices, but does not provide a direct link or affirmative statement for its own code release. |
| Open Datasets | Yes | Our main evaluation dataset is the English-language subset of the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2022). MTEB is a collection of 56 datasets... We train on a collection of publicly available datasets that are standard training datasets in the embedding literature. We list and describe each of the datasets in Appendix F. ... ELI5 (sample ratio 0.1) (Fan et al., 2019), Hotpot QA (Yang et al., 2018), FEVER (Thorne et al., 2018), MIRACL (Zhang et al., 2023b), MS-MARCO passage ranking (sample ratio 0.5) and document ranking (sample ratio 0.2) (Bajaj et al., 2018), NQ (Karpukhin et al., 2020), NLI (Gao et al., 2021b), SQu AD (Karpukhin et al., 2020), Trivia QA (Karpukhin et al., 2020), Quora Duplicate Questions (sample ratio 0.1) (Data Canary et al., 2017), Mr Ty Di (Zhang et al., 2021), Du Reader (Qiu et al., 2022), and T2Ranking (sample ratio 0.5) (Xie et al., 2023). |
| Dataset Splits | Yes | Our main evaluation dataset is the English-language subset of the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2022). ... Thus, we adopt a smaller 28-dataset subset of MTEB that spans all categories except summarization, which we call MTEB-MINI. ... We train on a collection of publicly available datasets that are standard training datasets in the embedding literature. ... Each batch is constructed by sampling a dataset from our set of training dataset, and then collecting examples from only this dataset. |
| Hardware Specification | Yes | MTEB is massive, and requires multiple days to evaluate a 7B-parameter model on 8 A100 GPUs. ... Training a model takes approximately two days on four A100 GPUs. Evaluating a model on the MTEB benchmark can be completed in parallel in approximately two days with eight A100s. |
| Software Dependencies | No | The paper mentions several language models used (e.g., Mistral-7B-Instruct-v0.1, LLa MA-2-7B-Instruct, S-LLa MA-1.3B) and techniques like Grad Cache and Lo RA, but it does not specify version numbers for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, Hugging Face Transformers). |
| Experiment Setup | Yes | To fine-tune the model, we optimize the Sim CSE loss with in-batch and mined hard negatives. ... We train with a large batch size (2048) with limited GPU memory (Gao et al., 2021a). We train with Lo RA instead of full finetuning, with r = 16 and α = 16. We choose τ = 1/50 and a learning rate of 8 × 10−4. ... We train for half as many steps as classical embeddings. |