Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
Authors: Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous Mo E construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our Mo E model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. |
| Researcher Affiliation | Collaboration | Taishi Nakamura1,2,3, Takuya Akiba2, Kazuki Fujii1, Yusuke Oda3, Rio Yokota1,3, Jun Suzuki4,5,3 1Institute of Science Tokyo, 2Sakana AI, 3NII LLMC, 4Tohoku University, 5RIKEN EMAIL, EMAIL |
| Pseudocode | No | The paper describes the Drop-Upcycling method in Section 3 and its sub-sections using prose and mathematical equations. It includes an overview diagram (Figure 1) and an illustration of weight initialization (Figure 2), but no explicitly labeled pseudocode or algorithm blocks are present. |
| Open Source Code | Yes | All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on Mo E. ... Code github.com/Taishi-N324/Drop-Upcycling |
| Open Datasets | Yes | Our training data was obtained from publicly available data. We describe the detailed statistics of the training datasets in Appendix B.1. ... We used the LLM-jp corpus v34, an open corpus curated by the LLM-jp working group, for training English and Japanese bilingual language models. ... Table 5: Statistics of the training dataset. ... English Dolma 1.6 (sampled) (Soldaini et al., 2024) ... Japanese Common Crawl (LLM-jp, 2024) ... Code The Stack (Kocetkov et al., 2023) |
| Dataset Splits | No | The paper specifies the total number of tokens for training models ("dense models were trained on 1T tokens, and Mo E models were trained on 500B tokens") and lists the composition of the training dataset by language and source (Table 6). However, it does not explicitly provide information on how the main LLM-jp corpus was split into training, validation, or test sets for the models being trained, nor does it specify splits for the various evaluation datasets beyond noting they are 'validation sets' from other works. |
| Hardware Specification | Yes | For our experiments with Mo E models and the training of the 1.5B Dense model, we utilized the TSUBAME 4.0 supercomputer at the Global Scientific Information and Computing Center, Institute of Science Tokyo. This environment is equipped with NVIDIA H100 SXM5 94GB GPUs, with each node housing 4 H100 GPUs. ... For the training of the 152M and 3.7B Dense models, we leveraged the high-performance computing nodes (PHY) provided by Sakura Internet. This setup features NVIDIA H100 80GB GPUs, with each node containing 8 H100 GPUs. |
| Software Dependencies | Yes | For implementation, we used Megatron-LM2 for Dense model training, and moe-recipes3 for Mo E model training. Additionally, Flash Attention 2 (Dao, 2024) was utilized to improve computational efficiency and reduce memory usage. All the training processes were conducted using bfloat16 precision. ... 3https://github.com/rioyokotalab/moe-recipes, Version 1.0.0 |
| Experiment Setup | Yes | As shared settings for training all models, we adopted the following hyperparameters: Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9, β2 = 0.95, and ϵ = 10 8, sequence length of 4096, weight decay of 0.1, and gradient clipping of 1.0. The global batch size was set to 1024 for the 1.5B, 3.7B and 13B models, and 512 for the 152M model. We used cosine decay for learning rate scheduling. For Dense models, the maximum learning rate was set to 3 10 4, and it decayed to 3 10 5 over 1,000B tokens... For Mo E models, the maximum learning rate was set to 2 10 4, and it decayed to 2 10 5 over 500B tokens. Additionally, to prevent instability in training due to unbalanced routing on the Mo E models, a load balancing loss was introduced, with the coefficient unified at 0.02 across all Mo E models. |