reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generative Representational Instruction Tuning

Authors: Niklas Muennighoff, Hongjin SU, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, Douwe Kiela

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we ﬁrst outline our experimental setup in 3.1. In 3.2, we discuss and benchmark the embedding and generative performance of our models. In Appendix B, we ablate the settings that led to our ﬁnal models, including training data, precision, pooling, sequence length, and loss weights. We ﬁnd that GRITLM 7B outperforms various prior open models on the Massive Text Embedding Benchmark (Muennighoff et al., 2023c) while still outperforming a range of generative models up to its size of 7 billion parameters.
Researcher Affiliation	Collaboration	Niklas Muennighoff c Hongjin Su h Liang Wang m Nan Yang m Furu Wei m Tao Yu h Amanpreet Singh c Douwe Kiela c c Contextual AI h The University of Hong Kong m Microsoft Corporation EMAIL
Pseudocode	Yes	Listing 1: Splitting of the embedding pass to save memory, simpliﬁed.
Open Source Code	Yes	Models, code, etc. are freely available at https://github.com/Contextual AI/gritlm.
Open Datasets	Yes	We ﬁnetune our ﬁnal models from Mistral 7B (Jiang et al., 2023) and Mixtral 8x7B (Jiang et al., 2024) using adaptations of E5 (Wang et al., 2024) and the T ulu 2 data (Ivison et al., 2023). For E5, we adapt it by adding S2ORC (Lo et al., 2020) to increase its scientiﬁc data ( E5S ), while for T ulu 2 we ﬁlter out their custom prompts that contain answers related to the origin of their model. For embedding performance we evaluate using the 56 main datasets from MTEB (Muennighoff et al., 2023c).
Dataset Splits	Yes	For GRITLM 7B, we use a batch size of 2048 for embedding data and 256 for generative data and we train the model for a total of 1253 steps corresponding to one epoch on the generative data and 1.36 epochs on the embedding data. For embedding performance we evaluate using the 56 main datasets from MTEB (Muennighoff et al., 2023c). For generative performance, we largely follow the evaluation setup of Ivison et al. (2023) except that we use the Human Eval Synthesize (Muennighoff et al., 2023a) variant of Human Eval, as it is more adequate for instruction-following models. We benchmark the caching variants on Natural Questions (Kwiatkowski et al., 2019) using 2,681,468 documents from BEIR NQ (Thakur et al., 2021) as our index.
Hardware Specification	Yes	CPU and GPU latencies are measured on an Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz and one NVIDIA H100 80GB HBM3 , respectively. For the training of GRITLM 7B, we used 8 nodes with 8 NVIDIA A100 80GB GPUs each for 48 hours corresponding to 3,072 GPU hours. Meanwhile for GRITLM 8X7B, we used 32 nodes with 8 NVIDIA H100 80GB GPUs each for 80 hours corresponding to 20,480 GPU hours.
Software Dependencies	No	We use Py Torch FSDP (Zhao et al., 2023), gradient checkpointing, BF16 mixed precision training, and strategies outlined in Appendix L. During training, we use a sequence length of 2048 for generative samples, 256 for embedding queries, and 2048 for embedding documents unless otherwise speciﬁed. We ﬁnetune using the Adam optimizer (Kingma & Ba, 2017) with beta1=0.9 and beta2=0.999 and no weight decay. We also use Flash-Attention 2 (Dao et al., 2022; Dao, 2023) via Py Torch SDPA. The paper mentions software components like PyTorch FSDP and Flash-Attention 2, but does not specify their version numbers.
Experiment Setup	Yes	For GRITLM 7B, we use a batch size of 2048 for embedding data and 256 for generative data and we train the model for a total of 1253 steps corresponding to one epoch on the generative data and 1.36 epochs on the embedding data. Our learning rate is 2e-5, we use 3% of steps for linear warm-up of the learning rate and decay it linearly to 0 over training. To save memory, we use Py Torch FSDP (Zhao et al., 2023), gradient checkpointing, BF16 mixed precision training, and strategies outlined in Appendix L. During training, we use a sequence length of 2048 for generative samples, 256 for embedding queries, and 2048 for embedding documents unless otherwise speciﬁed. We ﬁnetune using the Adam optimizer (Kingma & Ba, 2017) with beta1=0.9 and beta2=0.999 and no weight decay. We also use Flash-Attention 2 (Dao et al., 2022; Dao, 2023) via Py Torch SDPA.