OLMoE: Open Mixture-of-Experts Language Models
Authors: Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah Smith, Pang Wei Koh, Amanpreet Singh, Hanna Hajishirzi
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce OLMOE,1 a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (Mo E). OLMOE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMOE-1B-7B-INSTRUCT. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama213B-Chat and Deep Seek Mo E-16B. We present novel findings on Mo E training, define and analyze new routing properties showing high specialization in our model, and open-source all our work: model weights, training data, code, and logs. |
| Researcher Affiliation | Collaboration | Niklas Muennighoffca Luca Soldainia Dirk Groenevelda Kyle Loa Jacob Morrisona Sewon Mina Weijia Shiw Pete Walsha Oyvind Tafjorda Nathan Lamberta Yuling Gua Shane Aroraa Akshita Bhagiaa Dustin Schwenka David Waddena Alexander Wettigap Binyuan Hui Tim Dettmersa Douwe Kielac Ali Farhadiaw Noah A. Smithaw Pang Wei Kohaw Amanpreet Singhc Hannaneh Hajishirziaw a Allen Institute for AI c Contextual AI w University of Washington p Princeton University EMAIL EMAIL |
| Pseudocode | No | The paper describes the model architecture and training process using equations and descriptive text, but does not include any explicitly labeled pseudocode or algorithm blocks. For example, the MoE module is defined in Equation (1). |
| Open Source Code | Yes | We present novel findings on Mo E training, define and analyze new routing properties showing high specialization in our model, and open-source all our work: model weights, training data, code, and logs. Code https://github.com/allenai/OLMo E |
| Open Datasets | Yes | We present novel findings on Mo E training, define and analyze new routing properties showing high specialization in our model, and open-source all our work: model weights, training data, code, and logs. Data https://hf.co/datasets/allenai/OLMo E-mix-0924 |
| Dataset Splits | Yes | Our evaluation procedure consists of three parts: During pretraining (Appendix F), After pretraining, and After adaptation. We detail the setup for each in Appendix D. During pretraining We evaluate using a similar in-loop evaluation setup as Groeneveld et al. (2024), with the addition of more tasks such as Commonsense QA, PIQA, and different implementations of MMLU. Following Groeneveld et al. (2024), for the majority of the tasks, we perform 0-shot evaluation using the Completion/Cloze formulation (CF), ranking each answer string using language model probabilities...For MMLU, the in-loop evaluation also includes a setup where we increase the total number of instances by including a range of 0-shot to 5-shot setups...We also evaluate perplexity on selected validation sets from Paloma (Magnusson et al., 2023; Reid et al., 2022; Gao et al., 2020; Soldaini et al., 2024; Liang et al., 2023; Merity et al., 2016). |
| Hardware Specification | Yes | We pretrain OLMOE-1B-7B on 256 H100 GPUs for approximately 10 days with NV-link interconnect across GPUs and Infini Band interconnect across nodes. We also use H100 GPUs for all our experiments but some use a cluster with GCP TCPx interconnect across nodes instead. For adaptation, we use 32 H100 GPUs for 33 hours to instruction tune and for another 14 hours to preference tune via DPO. For KTO adaptation we use 8 H100 GPUs for 30 hours instead. |
| Software Dependencies | No | The paper mentions using Adam W optimizer, Ze RO, Py Torch FSDP, and mixed-precision training. However, it does not specify concrete version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | Table C3: Pretraining hyperparameters of OLMOE-1B-7B and comparable models trained from scratch. We highlight rows where OLMOE-1B-7B differs from OLMo-1B. Active params include vocab params. ? = undisclosed settings, FFN = feed-forward network, Attn = Attention, LR = learning rate, WSD = Weight-Stable-Decay (Hu et al., 2024), LBL = load balancing loss, Inv Sq Root = Inverse Square Root decay (Shazeer & Stern, 2018), trunc = truncation, std = standard deviation, varies = stds that are layer or weight-dependent. Adaptation For finetuning we use Open Instruct (Wang et al., 2023; Ivison et al., 2023).7 We filter all SFT samples to a length of fewer than 4096 tokens to match the sequence length of the model. Following Muennighoff et al. (2024), we aggregate loss at the token level during SFT to improve performance on long generative tasks, such as Alpaca Eval. We finetune in BF16 with a global batch size of 128 (4 H100 nodes with 8 GPUs each, a per device batch size of 2, and 2 gradient accumulation steps). We train for 2 epochs with a constant learning rate of 2.0E-5. For DPO (Rafailov et al., 2023), we reduce the global batch size to 32 (4 H100 nodes with 8 GPUs each and a per device batch size of 1). We train for 3 epochs with a learning rate of 5.0E-7 and a DPO beta of 0.1. |