Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts

Authors: Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunghun Kim, Jiayi Huang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results reveal that, compared to the standard top-2 Mo E, our proposed Sc Mo E architecture optimally accelerates training by 1.49 and 1.14 in 8 A30-PCIe and 8 A800-NVLink scenarios characterized by high and low communication overheads, respectively, and accelerates inference by 1.82 and 1.21 . Moreover, our experiments and analyses indicate that Sc Mo E not only achieves comparable but in some instances surpasses the model quality of existing approaches. We conduct empirical evaluation and theoretical analysis on our methods, confirming that our methods accelerate Mo E models while achieving comparable or even better model quality compared to existing methods, and offer in-depth analysis and discussion on the effectiveness of the proposed shortcut connection.
Researcher Affiliation Academia 1The Hong Kong University of Science and Technology (Guangzhou). Correspondence to: Jiayi Huang <EMAIL>.
Pseudocode No The paper describes the architecture and strategy using mathematical equations and figures, but no structured pseudocode or algorithm blocks are explicitly labeled or formatted as such.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets Yes Specifically, we pre-train the Swin V2-Mo E models with various Mo E architectures on Image Net-1K image classification dataset, and subsequently evaluate their accuracy on the corresponding test set. For models undergoing zero-shot evaluation on downstream tasks such as Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2021), Bool Q (Clark et al., 2019), ARC-Easy (Clark et al., 2018), Open Book QA (Mihaylov et al., 2018), RACE (Lai et al., 2017), and Math QA (Amini et al., 2019), we pre-train the models using various architectures on a 1B token subset of the Slim Pajama-627B dataset (Soboleva et al., 2023). For models evaluated on Wiki Text-103 (Merity et al., 2017), we conduct pre-training with different architectures on the Open Webtext dataset (Gokaslan & Cohen, 2019).
Dataset Splits No The paper mentions evaluating on 'test set' and 'final validation loss' for various datasets (ImageNet-1K, Slim Pajama-627B, Open Webtext, WikiText-103, etc.) and pre-training for specific epochs (e.g., '90 epochs'). However, it does not explicitly provide specific percentages, sample counts, or detailed methodologies for how these datasets were split into training, validation, and test sets for the experiments.
Hardware Specification Yes To assess the effectiveness of our proposed overlapping strategy for enhancing expert parallelism, we conduct experiments on three hardware configurations: 8 A30-PCIe, 8 A800-NVLink, and 16 A800-NVLink (across 2 nodes). These configurations cover scenarios with both high and low communication-to-computation ratios. Additionally, we evaluate our proposed expert offloading strategy on a configuration with a single A30-PCIe GPU.
Software Dependencies No For natural language generation (NLG) tasks, we utilize the standard implementations of GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020) and LLa MA-2 (Touvron et al., 2023) from Fairseq (Ott et al., 2019), augmented with Tutel Mo E to construct GPT2-Mo E, GPT3-Mo E and LLa MA2-Mo E models. The paper mentions software such as Fairseq and Tutel Mo E, but does not provide specific version numbers for any of these components.
Experiment Setup Yes Table 8. Hyperparameters for GPT-Mo E and LLa MA2-Mo E models. Parameter: Num. layers, Embedding dim, Num. attention heads, Num. KV heads, Num. experts per layer, Mo E frequency, Num. parameters, Context/sequence length, Capacity factor, Mo E loss coefficient. Table 9. Hyperparameters for Swin V2-Mo E models. Parameter: Image size, Window size, Embedding dim, Num. layers, Num. attention heads, Num. experts per layer, Batch size, Epochs, Warmup epochs, Base LR, Warmup LR, Min LR, Capacity factor, Mo E loss coefficient.