Generative Representational Instruction Tuning

Authors: Niklas Muennighoff, Hongjin SU, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, Douwe Kiela

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we first outline our experimental setup in 3.1. In 3.2, we discuss and benchmark the embedding and generative performance of our models. In Appendix B, we ablate the settings that led to our final models, including training data, precision, pooling, sequence length, and loss weights. We find that GRITLM 7B outperforms various prior open models on the Massive Text Embedding Benchmark (Muennighoff et al., 2023c) while still outperforming a range of generative models up to its size of 7 billion parameters.
Researcher Affiliation Collaboration Niklas Muennighoff c Hongjin Su h Liang Wang m Nan Yang m Furu Wei m Tao Yu h Amanpreet Singh c Douwe Kiela c c Contextual AI h The University of Hong Kong m Microsoft Corporation EMAIL
Pseudocode Yes Listing 1: Splitting of the embedding pass to save memory, simplified.
Open Source Code Yes Models, code, etc. are freely available at https://github.com/Contextual AI/gritlm.
Open Datasets Yes We finetune our final models from Mistral 7B (Jiang et al., 2023) and Mixtral 8x7B (Jiang et al., 2024) using adaptations of E5 (Wang et al., 2024) and the T ulu 2 data (Ivison et al., 2023). For E5, we adapt it by adding S2ORC (Lo et al., 2020) to increase its scientific data ( E5S ), while for T ulu 2 we filter out their custom prompts that contain answers related to the origin of their model. For embedding performance we evaluate using the 56 main datasets from MTEB (Muennighoff et al., 2023c).
Dataset Splits Yes For GRITLM 7B, we use a batch size of 2048 for embedding data and 256 for generative data and we train the model for a total of 1253 steps corresponding to one epoch on the generative data and 1.36 epochs on the embedding data. For embedding performance we evaluate using the 56 main datasets from MTEB (Muennighoff et al., 2023c). For generative performance, we largely follow the evaluation setup of Ivison et al. (2023) except that we use the Human Eval Synthesize (Muennighoff et al., 2023a) variant of Human Eval, as it is more adequate for instruction-following models. We benchmark the caching variants on Natural Questions (Kwiatkowski et al., 2019) using 2,681,468 documents from BEIR NQ (Thakur et al., 2021) as our index.
Hardware Specification Yes CPU and GPU latencies are measured on an Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz and one NVIDIA H100 80GB HBM3 , respectively. For the training of GRITLM 7B, we used 8 nodes with 8 NVIDIA A100 80GB GPUs each for 48 hours corresponding to 3,072 GPU hours. Meanwhile for GRITLM 8X7B, we used 32 nodes with 8 NVIDIA H100 80GB GPUs each for 80 hours corresponding to 20,480 GPU hours.
Software Dependencies No We use Py Torch FSDP (Zhao et al., 2023), gradient checkpointing, BF16 mixed precision training, and strategies outlined in Appendix L. During training, we use a sequence length of 2048 for generative samples, 256 for embedding queries, and 2048 for embedding documents unless otherwise specified. We finetune using the Adam optimizer (Kingma & Ba, 2017) with beta1=0.9 and beta2=0.999 and no weight decay. We also use Flash-Attention 2 (Dao et al., 2022; Dao, 2023) via Py Torch SDPA. The paper mentions software components like PyTorch FSDP and Flash-Attention 2, but does not specify their version numbers.
Experiment Setup Yes For GRITLM 7B, we use a batch size of 2048 for embedding data and 256 for generative data and we train the model for a total of 1253 steps corresponding to one epoch on the generative data and 1.36 epochs on the embedding data. Our learning rate is 2e-5, we use 3% of steps for linear warm-up of the learning rate and decay it linearly to 0 over training. To save memory, we use Py Torch FSDP (Zhao et al., 2023), gradient checkpointing, BF16 mixed precision training, and strategies outlined in Appendix L. During training, we use a sequence length of 2048 for generative samples, 256 for embedding queries, and 2048 for embedding documents unless otherwise specified. We finetune using the Adam optimizer (Kingma & Ba, 2017) with beta1=0.9 and beta2=0.999 and no weight decay. We also use Flash-Attention 2 (Dao et al., 2022; Dao, 2023) via Py Torch SDPA.