Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
Authors: Jan Ludziejewski, Maciej Pióro, Jakub Krajewski, Maciej Stefaniak, Michał Krutul, Jan Małaśnicki, Marek Cygan, Piotr Sankowski, Kamil Adamczewski, Piotr Miłoś, Sebastian Jaszczur
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our findings provide a principled framework for selecting the optimal Mo E configuration under fixed memory and compute budgets. Surprisingly, we show that Mo E models can be more memory-efficient than dense models, contradicting conventional wisdom. Extensive empirical validation confirms the theoretical predictions of our scaling laws. These results offer actionable insights for designing and deploying Mo E models in practical large-scale training scenarios. (...) Our conclusions are based on extensive, large-scale experiments comprising 270 models, scaled up to 5B parameters. |
| Researcher Affiliation | Collaboration | 1University of Warsaw 2IDEAS NCBR 3Institute of Fundamental Technological Research, Polish Academy of Sciences 4Nomagic 5Research Institute IDEAS 6MIM Solutions 7Wroclaw University of Science and Technology 8Institute of Mathematics, Polish Academy of Sciences. |
| Pseudocode | No | The paper describes methods and formulas in text and mathematical equations, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present any structured, code-like procedures. |
| Open Source Code | Yes | Checkpoints and inference code are available on Hugging Face. Codebase used to run the experiments can be found on Git Hub. |
| Open Datasets | Yes | All models used in this study are decoder-only Transformers trained on the highly filtered Fine Web-Edu (Penedo et al., 2024). |
| Dataset Splits | No | The paper mentions using 'Fine Web-Edu' and performing experiments, but does not explicitly detail how this dataset was split into training, validation, or test sets with percentages, sample counts, or specific predefined split references. |
| Hardware Specification | Yes | Table 2. Optimal E for different training budgets and three typical memory constraints, corresponding to an RTX4090 GPU, an H100 GPU, and an 8x H100 GPU node. |
| Software Dependencies | No | The paper mentions software components like 'GPT-2 tokenizer', 'Switch (Fedus et al., 2022) layers', 'Ro PE (Su et al., 2024)', 'Swi GLU activation (Shazeer, 2020)', and 'bfloat16' for mixed precision training. However, it does not provide specific version numbers for these or other underlying software libraries (e.g., Python, PyTorch). |
| Experiment Setup | Yes | We use a Transformer model with Switch (Fedus et al., 2022) layers, using standard values of router z-loss 0.001 and load balancing loss 0.01. The GPT-2 tokenizer (Radford et al., 2018) is employed. For better stability, weight initialization follows a truncated normal distribution with a reduced scale of 0.1, as suggested by (Fedus et al., 2022). Mixed precision training is used, with the attention mechanism, position embeddings Ro PE (Su et al., 2024) and router always maintained at high precision. The models use the Swi GLU activation (Shazeer, 2020) with hidden size equal to 3dmodel and activate one expert per token (unless the token is dropped due to limited capacity). (...) We increase the batch size from 64K to 128K after 0.5B training tokens and further to 256K after 1B training tokens. (...) We employ a constant learning rate schedule with a linear warmup over the initial 130M tokens and with a linear decay from the peak learning rate to 0 over the final 20% of tokens. |