Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

Authors: Jan Ludziejewski, Maciej Pióro, Jakub Krajewski, Maciej Stefaniak, Michał Krutul, Jan Małaśnicki, Marek Cygan, Piotr Sankowski, Kamil Adamczewski, Piotr Miłoś, Sebastian Jaszczur

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our findings provide a principled framework for selecting the optimal Mo E configuration under fixed memory and compute budgets. Surprisingly, we show that Mo E models can be more memory-efficient than dense models, contradicting conventional wisdom. Extensive empirical validation confirms the theoretical predictions of our scaling laws. These results offer actionable insights for designing and deploying Mo E models in practical large-scale training scenarios. (...) Our conclusions are based on extensive, large-scale experiments comprising 270 models, scaled up to 5B parameters.
Researcher Affiliation Collaboration 1University of Warsaw 2IDEAS NCBR 3Institute of Fundamental Technological Research, Polish Academy of Sciences 4Nomagic 5Research Institute IDEAS 6MIM Solutions 7Wroclaw University of Science and Technology 8Institute of Mathematics, Polish Academy of Sciences.
Pseudocode No The paper describes methods and formulas in text and mathematical equations, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present any structured, code-like procedures.
Open Source Code Yes Checkpoints and inference code are available on Hugging Face. Codebase used to run the experiments can be found on Git Hub.
Open Datasets Yes All models used in this study are decoder-only Transformers trained on the highly filtered Fine Web-Edu (Penedo et al., 2024).
Dataset Splits No The paper mentions using 'Fine Web-Edu' and performing experiments, but does not explicitly detail how this dataset was split into training, validation, or test sets with percentages, sample counts, or specific predefined split references.
Hardware Specification Yes Table 2. Optimal E for different training budgets and three typical memory constraints, corresponding to an RTX4090 GPU, an H100 GPU, and an 8x H100 GPU node.
Software Dependencies No The paper mentions software components like 'GPT-2 tokenizer', 'Switch (Fedus et al., 2022) layers', 'Ro PE (Su et al., 2024)', 'Swi GLU activation (Shazeer, 2020)', and 'bfloat16' for mixed precision training. However, it does not provide specific version numbers for these or other underlying software libraries (e.g., Python, PyTorch).
Experiment Setup Yes We use a Transformer model with Switch (Fedus et al., 2022) layers, using standard values of router z-loss 0.001 and load balancing loss 0.01. The GPT-2 tokenizer (Radford et al., 2018) is employed. For better stability, weight initialization follows a truncated normal distribution with a reduced scale of 0.1, as suggested by (Fedus et al., 2022). Mixed precision training is used, with the attention mechanism, position embeddings Ro PE (Su et al., 2024) and router always maintained at high precision. The models use the Swi GLU activation (Shazeer, 2020) with hidden size equal to 3dmodel and activate one expert per token (unless the token is dropped due to limited capacity). (...) We increase the batch size from 64K to 128K after 0.5B training tokens and further to 256K after 1B training tokens. (...) We employ a constant learning rate schedule with a linear warmup over the initial 130M tokens and with a linear decay from the peak learning rate to 0 over the final 20% of tokens.