BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference
Authors: Zewen Jin, Shengnan Wang, Jiaan Zhu, Hongrui Zhan, Youhui Bai, Lin Zhang, Zhenyu Ming, Cheng Li
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Big Mac achieves comparable or even better model quality than fine-grained Mo Es with the same number of experts and a similar number of total parameters. Equally importantly, Big Mac reduces the end-to-end latency by up to 3.09 for training and increases the throughput by up to 3.11 for inference on state-of-the-art AI computing frameworks including Megatron, Tutel, and Deep Speed-Inference. [...] We intensively profile the time ratios of training and inference for GPT-Fine-Grained and GPT-Big Mac, based on the state-of-the-art frameworks Megatron (Shoeybi et al. 2020), Tutel (Hwang et al. 2023), and Deep Speed-Inference (Microsoft 2024). |
| Researcher Affiliation | Collaboration | Zewen Jin1 2*, Shengnan Wang2*, Jiaan Zhu1 3, Hongrui Zhan1, Youhui Bai2, Lin Zhang2, Zhenyu Ming2, Cheng Li1 3 1University of Science and Technology of China 2Huawei Technologies 3Institute of Artificial Intelligence, Hefei Comprehensive National Science Center |
| Pseudocode | No | The paper does not contain any sections explicitly labeled "Pseudocode" or "Algorithm", nor does it present any structured, code-like formatted procedures. |
| Open Source Code | No | The paper mentions using existing state-of-the-art distributed training/inference frameworks like Megatron, Tutel, and Deep Speed-Inference, but there is no explicit statement or link indicating that the authors have released their own implementation code for the Big Mac methodology. |
| Open Datasets | Yes | We use the Wikipedia dataset (Wikimedia 2024) containing 3.6 B tokens to train these models on Megatron (NVIDIA 2019), one of the stateof-the-art LLM training frameworks. [...] we utilized a larger dataset named Open Web Text2 dataset (Eleuther AI 2020) with 14.8 B tokens. |
| Dataset Splits | No | The paper mentions training on the Wikipedia and Open Web Text2 datasets and evaluating perplexity, but it does not specify the exact training, validation, or test splits (e.g., percentages or sample counts) for these datasets. It refers to 'validation perplexity' but not the split methodology. |
| Hardware Specification | Yes | All the experiments are conducted on a cluster of 4 machines connected with 100 Gbps Infini Band. Each machine has the same configuration and is equipped with eight GPUs. Each GPU is connected with PCIe 4.0 x 16 and has 48 GB HBM, delivering up to 149.7 TFLOPS (FP16) with 96 cores. |
| Software Dependencies | No | The paper mentions using "Megatron", "Tutel", and "Deep Speed-Inference" frameworks, but it does not provide specific version numbers for these or any other software components. |
| Experiment Setup | Yes | Table 5: Hyper-parameters of pre-training to compare the validation perplexity curves in Figure 1. This table includes specific values for #Layers, #Heads, Hidden Dimension, Sequence Length, Vocabulary Size, Global Batch Size, Dropout Rate, Load Balance Type, Balance Coefficient, Optimizer (Adam), ϵ, β, Weight Decay, Learning Rate, Minimum Learning Rate, Learning Decay Steps, Learning Rate Decay Style, Warmup Steps, Gradient Clipping, and Random Seed. |