NetMoE: Accelerating MoE Training through Dynamic Sample Placement

Authors: Xinyi Liu, Yujie Wang, Fangcheng Fu, Xupeng Miao, Shenhan Zhu, Xiaonan Nie, Bin CUI

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments with 32 GPUs show that Net Mo E achieves a maximum efficiency improvement of 1.67 compared with current Mo E training frameworks.
Researcher Affiliation Academia 1School of CS & Key Lab of High Confidence Software Technologies (MOE), Peking University 2Purdue University 3Institute of Computational Social Science, Peking University (Qingdao) EMAIL EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Net Mo E Optimization
Open Source Code No The paper does not contain any explicit statements or links indicating the availability of source code for the described methodology.
Open Datasets No The paper mentions using GPT model architectures (Radford et al., 2019; Brown et al., 2020) as backbones and refers to training samples/tokens, but does not provide concrete access information (links, DOIs, repositories, or specific citations) for any datasets used in the experiments. The tables describe model configurations rather than datasets.
Dataset Splits No The paper defines 'I' as 'The number of samples per iteration (a.k.a. global batch size)' but does not provide specific details on how any dataset was split into training, validation, or test sets.
Hardware Specification Yes All experiments are conducted on a cluster consisting of 4 nodes, each equipped with 8 NVIDIA A800SXM4-40GB GPUs. As listed in Table 2, the GPUs within each node are connected via NVLink with a 400 GB/s bandwidth, while the nodes are interconnected via Infini Band with a 100 GB/s bandwidth.
Software Dependencies No Net Mo E is implemented on top of Py Torch (Paszke et al., 2019), with custom operations (e.g., the calculation of num, c, c , and the KM algorithm) implemented in C++ and CUDA. However, specific version numbers for PyTorch, C++, or CUDA are not provided.
Experiment Setup Yes The configurations of the evaluated models are listed in Table 3. We select the GPT model architecture (Radford et al., 2019; Brown et al., 2020) as the backbone and replace all FFN layers in each model with Mo E layers. In particular, since Smart Mo E requires at least 2 experts on each device, we set the number of experts as E = 2 J, where J is the number of GPUs in the corresponding experiment, and we fix the number of selected experts for each token as K = 2. By default, we utilize 8 GPUs per node to carry out the experiments, and we present the results for scenarios with fewer GPUs per node in Appendix B. All results are averaged over 50 iterations.