NetMoE: Accelerating MoE Training through Dynamic Sample Placement
Authors: Xinyi Liu, Yujie Wang, Fangcheng Fu, Xupeng Miao, Shenhan Zhu, Xiaonan Nie, Bin CUI
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with 32 GPUs show that Net Mo E achieves a maximum efficiency improvement of 1.67 compared with current Mo E training frameworks. |
| Researcher Affiliation | Academia | 1School of CS & Key Lab of High Confidence Software Technologies (MOE), Peking University 2Purdue University 3Institute of Computational Social Science, Peking University (Qingdao) EMAIL EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Net Mo E Optimization |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating the availability of source code for the described methodology. |
| Open Datasets | No | The paper mentions using GPT model architectures (Radford et al., 2019; Brown et al., 2020) as backbones and refers to training samples/tokens, but does not provide concrete access information (links, DOIs, repositories, or specific citations) for any datasets used in the experiments. The tables describe model configurations rather than datasets. |
| Dataset Splits | No | The paper defines 'I' as 'The number of samples per iteration (a.k.a. global batch size)' but does not provide specific details on how any dataset was split into training, validation, or test sets. |
| Hardware Specification | Yes | All experiments are conducted on a cluster consisting of 4 nodes, each equipped with 8 NVIDIA A800SXM4-40GB GPUs. As listed in Table 2, the GPUs within each node are connected via NVLink with a 400 GB/s bandwidth, while the nodes are interconnected via Infini Band with a 100 GB/s bandwidth. |
| Software Dependencies | No | Net Mo E is implemented on top of Py Torch (Paszke et al., 2019), with custom operations (e.g., the calculation of num, c, c , and the KM algorithm) implemented in C++ and CUDA. However, specific version numbers for PyTorch, C++, or CUDA are not provided. |
| Experiment Setup | Yes | The configurations of the evaluated models are listed in Table 3. We select the GPT model architecture (Radford et al., 2019; Brown et al., 2020) as the backbone and replace all FFN layers in each model with Mo E layers. In particular, since Smart Mo E requires at least 2 experts on each device, we set the number of experts as E = 2 J, where J is the number of GPUs in the corresponding experiment, and we fix the number of selected experts for each token as K = 2. By default, we utilize 8 GPUs per node to carry out the experiments, and we present the results for scenarios with fewer GPUs per node in Appendix B. All results are averaged over 50 iterations. |