reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

NetMoE: Accelerating MoE Training through Dynamic Sample Placement

Authors: Xinyi Liu, Yujie Wang, Fangcheng Fu, Xupeng Miao, Shenhan Zhu, Xiaonan Nie, Bin CUI

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with 32 GPUs show that Net Mo E achieves a maximum efficiency improvement of 1.67 compared with current Mo E training frameworks.
Researcher Affiliation	Academia	1School of CS & Key Lab of High Confidence Software Technologies (MOE), Peking University 2Purdue University 3Institute of Computational Social Science, Peking University (Qingdao) EMAIL EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Net Mo E Optimization
Open Source Code	No	The paper does not contain any explicit statements or links indicating the availability of source code for the described methodology.
Open Datasets	No	The paper mentions using GPT model architectures (Radford et al., 2019; Brown et al., 2020) as backbones and refers to training samples/tokens, but does not provide concrete access information (links, DOIs, repositories, or specific citations) for any datasets used in the experiments. The tables describe model configurations rather than datasets.
Dataset Splits	No	The paper defines 'I' as 'The number of samples per iteration (a.k.a. global batch size)' but does not provide specific details on how any dataset was split into training, validation, or test sets.
Hardware Specification	Yes	All experiments are conducted on a cluster consisting of 4 nodes, each equipped with 8 NVIDIA A800SXM4-40GB GPUs. As listed in Table 2, the GPUs within each node are connected via NVLink with a 400 GB/s bandwidth, while the nodes are interconnected via Infini Band with a 100 GB/s bandwidth.
Software Dependencies	No	Net Mo E is implemented on top of Py Torch (Paszke et al., 2019), with custom operations (e.g., the calculation of num, c, c , and the KM algorithm) implemented in C++ and CUDA. However, specific version numbers for PyTorch, C++, or CUDA are not provided.
Experiment Setup	Yes	The configurations of the evaluated models are listed in Table 3. We select the GPT model architecture (Radford et al., 2019; Brown et al., 2020) as the backbone and replace all FFN layers in each model with Mo E layers. In particular, since Smart Mo E requires at least 2 experts on each device, we set the number of experts as E = 2 J, where J is the number of GPUs in the corresponding experiment, and we fix the number of selected experts for each token as K = 2. By default, we utilize 8 GPUs per node to carry out the experiments, and we present the results for scenarios with fewer GPUs per node in Appendix B. All results are averaged over 50 iterations.