Continual Pre-training of MoEs: How robust is your router?

Authors: Benjamin Thérien, Charles-Étienne Joseph, Zain Sarwar, Ashwinee Panda, Anirban Das, Shi-Xiong Zhang, Stephen Rawls, Sambit Sahu, Eugene Belilovsky, Irina Rish

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In what follows, we conduct a large-scale study training a 500M parameter dense transformer and four 500M-active/2B-total parameter Mo E transformers, following the Switch Transformer architecture and a granular Deep Seek-inspired architecture. Each model is trained for 600B tokens. Our results establish a surprising robustness to distribution shifts for Mo Es using both Sinkhorn-Balanced and Z-and-Aux-loss-balanced routing algorithms, even in Mo Es continually pre-trained without replay.
Researcher Affiliation Collaboration 1 Université de Montréal, Montréal, Canada 2 Mila Quebec AI Institute, Montréal, Canada 3 Concordia University, Montréal, Canada 4 University of Chicago, Chicago, USA 5 Capital One, New York, NY, USA
Pseudocode No The paper describes algorithms like the "Sinkhorn-knopp algorithm" and "routing algorithms" but does not provide any structured pseudocode or algorithm blocks.
Open Source Code No All our experiments use code from the GPT-Neo X library (Andonian et al., 2023) and leverage the megablox grouped GEMM kernel1 (Gale et al., 2023). We would like to preface this section with the following disclaimer: all the step times that we report in our study are specific to our code and the libraries that we use, but are not reflective of the best performance achievable.
Open Datasets Yes To initially pre-train and subsequently continually pre-train our models, we use three datasets: Fine Web (Penedo et al., 2024), the Stack (Kocetkov et al., 2023), and German Common Crawl (Abadji et al., 2022).
Dataset Splits Yes To initially pre-train and subsequently continually pre-train our models, we use three datasets: Fine Web (Penedo et al., 2024), the Stack (Kocetkov et al., 2023), and German Common Crawl (Abadji et al., 2022). We initially pre-train all models on Fine Web for 400B tokens (task 1)... Subsequently, we continually pre-train these base models on 200B tokens of code data and German web crawl data (task 2) using infinite learning rate schedules and replay (30% & 40%, respectively) to mitigate forgetting. Tables 8, 9, and 10 report the amount of training tokens and sampling proportions used for Fine Web, Germand, and Stack, respectively.
Hardware Specification Yes Each model was trained across 64 A100 GPUs using data parallelism and zero-1 (Rajbhandari et al., 2020).
Software Dependencies No All our experiments use code from the GPT-Neo X library (Andonian et al., 2023) and leverage the megablox grouped GEMM kernel1 (Gale et al., 2023). Although these libraries are mentioned, specific version numbers for them or other key software components are not provided.
Experiment Setup Yes All models in our study (except re-training baselines) were pre-trained for 192, 720 gradient descent steps using a batch size of 1024, a sequence length of 2048, the Adam W optimizer, and the Cosine Inf schedule (Ibrahim et al., 2024). We continually pre-train the models for 95, 370 gradient descent steps using the same batch size and sequence length as during pre-training. Table 11: Hyperparameters of LR schedules. Table 12: Hyperparameters of our Moes and Dense Transformer.