MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts
Authors: Peng Jin, Bo Zhu, Yuan Li, Shuicheng YAN
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results demonstrate that Mo E++ achieves better performance while delivering 1.1 2.1 expert forward throughput compared to a vanilla Mo E model of the same size, which lays a solid foundation for developing advanced and efficient Mo E-related models. |
| Researcher Affiliation | Collaboration | Peng Jin1,2, Bo Zhu3, Li Yuan1,2,4 , Shuicheng Yan3,5 1Pengcheng Laboratory 2School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University 3Kunlun 2050 Research & Skywork AI 4Rabbitpre Intelligence 5National University of Singapore EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods in text and mathematical formulas (Eq. 1-10) without presenting any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: https://github.com/Skywork AI/Mo E-plus-plus |
| Open Datasets | Yes | Mo E++ is trained exclusively on public datasets, making it accessible for academic research settings. Specifically, we sample from the Red Pajama (Computer, 2023a), Dolma (Soldaini et al., 2024), and Pile (Gao et al., 2020) datasets according to different sampling probabilities. |
| Dataset Splits | No | The paper specifies sampling ratios for assembling the training corpus from multiple datasets (Red Pajama, Dolma, Pile) in Appendix B.1 (Table D) and mentions training budgets (e.g., 100B tokens, 1T tokens). It evaluates on various downstream tasks using lm-evaluation-harness with different 'shot' settings. However, it does not explicitly provide specific training/validation/test splits (e.g., as percentages or exact counts) for the large pre-training corpus itself, nor does it detail how internal validation was performed during the pre-training phase. |
| Hardware Specification | Yes | We conduct training on a cluster with 4 nodes and 32 A100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Megatron (Shoeybi et al., 2019) as the training framework', 'the tokenizer of LLa MA2', 'Adam W optimizer (Loshchilov & Hutter, 2017)', and 'the lm-evaluation-harness package (Gao et al., 2024)'. However, specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | The weight β for the heterogeneous load balance loss is set to 0.01, and the expert capacity factor γ is set to 1.1. Mo E++ is trained using the Adam W optimizer... During training, a weight decay of 0.1 and gradient clipping of 1.0 are applied. Maximum learning rate 5e-4, Final learning rate 5e-5, LR warmup init 1e-7, LR warmup iters 2000, Sequence length 2048, Batch size (tokens) 4M. |