reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

Authors: Peng Jin, Bo Zhu, Yuan Li, Shuicheng YAN

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results demonstrate that Mo E++ achieves better performance while delivering 1.1 2.1 expert forward throughput compared to a vanilla Mo E model of the same size, which lays a solid foundation for developing advanced and efficient Mo E-related models.
Researcher Affiliation	Collaboration	Peng Jin1,2, Bo Zhu3, Li Yuan1,2,4 , Shuicheng Yan3,5 1Pengcheng Laboratory 2School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University 3Kunlun 2050 Research & Skywork AI 4Rabbitpre Intelligence 5National University of Singapore EMAIL, EMAIL
Pseudocode	No	The paper describes methods in text and mathematical formulas (Eq. 1-10) without presenting any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code: https://github.com/Skywork AI/Mo E-plus-plus
Open Datasets	Yes	Mo E++ is trained exclusively on public datasets, making it accessible for academic research settings. Specifically, we sample from the Red Pajama (Computer, 2023a), Dolma (Soldaini et al., 2024), and Pile (Gao et al., 2020) datasets according to different sampling probabilities.
Dataset Splits	No	The paper specifies sampling ratios for assembling the training corpus from multiple datasets (Red Pajama, Dolma, Pile) in Appendix B.1 (Table D) and mentions training budgets (e.g., 100B tokens, 1T tokens). It evaluates on various downstream tasks using lm-evaluation-harness with different 'shot' settings. However, it does not explicitly provide specific training/validation/test splits (e.g., as percentages or exact counts) for the large pre-training corpus itself, nor does it detail how internal validation was performed during the pre-training phase.
Hardware Specification	Yes	We conduct training on a cluster with 4 nodes and 32 A100 GPUs.
Software Dependencies	No	The paper mentions using 'Megatron (Shoeybi et al., 2019) as the training framework', 'the tokenizer of LLa MA2', 'Adam W optimizer (Loshchilov & Hutter, 2017)', and 'the lm-evaluation-harness package (Gao et al., 2024)'. However, specific version numbers for these software components are not provided.
Experiment Setup	Yes	The weight β for the heterogeneous load balance loss is set to 0.01, and the expert capacity factor γ is set to 1.1. Mo E++ is trained using the Adam W optimizer... During training, a weight decay of 0.1 and gradient clipping of 1.0 are applied. Maximum learning rate 5e-4, Final learning rate 5e-5, LR warmup init 1e-7, LR warmup iters 2000, Sequence length 2048, Batch size (tokens) 4M.