MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

Authors: Peng Jin, Bo Zhu, Yuan Li, Shuicheng YAN

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate that Mo E++ achieves better performance while delivering 1.1 2.1 expert forward throughput compared to a vanilla Mo E model of the same size, which lays a solid foundation for developing advanced and efficient Mo E-related models.
Researcher Affiliation Collaboration Peng Jin1,2, Bo Zhu3, Li Yuan1,2,4 , Shuicheng Yan3,5 1Pengcheng Laboratory 2School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University 3Kunlun 2050 Research & Skywork AI 4Rabbitpre Intelligence 5National University of Singapore EMAIL, EMAIL
Pseudocode No The paper describes methods in text and mathematical formulas (Eq. 1-10) without presenting any structured pseudocode or algorithm blocks.
Open Source Code Yes Code: https://github.com/Skywork AI/Mo E-plus-plus
Open Datasets Yes Mo E++ is trained exclusively on public datasets, making it accessible for academic research settings. Specifically, we sample from the Red Pajama (Computer, 2023a), Dolma (Soldaini et al., 2024), and Pile (Gao et al., 2020) datasets according to different sampling probabilities.
Dataset Splits No The paper specifies sampling ratios for assembling the training corpus from multiple datasets (Red Pajama, Dolma, Pile) in Appendix B.1 (Table D) and mentions training budgets (e.g., 100B tokens, 1T tokens). It evaluates on various downstream tasks using lm-evaluation-harness with different 'shot' settings. However, it does not explicitly provide specific training/validation/test splits (e.g., as percentages or exact counts) for the large pre-training corpus itself, nor does it detail how internal validation was performed during the pre-training phase.
Hardware Specification Yes We conduct training on a cluster with 4 nodes and 32 A100 GPUs.
Software Dependencies No The paper mentions using 'Megatron (Shoeybi et al., 2019) as the training framework', 'the tokenizer of LLa MA2', 'Adam W optimizer (Loshchilov & Hutter, 2017)', and 'the lm-evaluation-harness package (Gao et al., 2024)'. However, specific version numbers for these software components are not provided.
Experiment Setup Yes The weight β for the heterogeneous load balance loss is set to 0.01, and the expert capacity factor γ is set to 1.1. Mo E++ is trained using the Adam W optimizer... During training, a weight decay of 0.1 and gradient clipping of 1.0 are applied. Maximum learning rate 5e-4, Final learning rate 5e-5, LR warmup init 1e-7, LR warmup iters 2000, Sequence length 2048, Batch size (tokens) 4M.