Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

Authors: Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Junnan Li, Silvio Savarese, Caiming Xiong, Doyen Sahoo

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on 39 datasets demonstrate the superiority of MOIRAI-MOE over state-of-the-art foundation models. This study also conducts comprehensive model analyses to explore the inner workings of time series Mo E foundation models.
Researcher Affiliation Collaboration 1Salesforce AI Research 2National University of Singapore 3The Hong Kong University of Science and Technology (Guangzhou). Correspondence to: Chenghao Liu <EMAIL>.
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations (e.g., equations 1-7) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete statement about the release of source code or a link to a code repository.
Open Datasets Yes Extensive evaluations on 39 datasets demonstrate the superiority of MOIRAI-MOE over state-of-the-art foundation models. This study also conducts comprehensive model analyses to explore the inner workings of time series Mo E foundation models. In-distribution evaluation using 29 datasets from the Monash benchmark (Godahewa et al., 2021). We conduct zero-shot evaluations on the datasets listed in Table 7, which cover five domains and span frequencies ranging from minute-level to weekly.
Dataset Splits Yes We use a non-overlapping rolling window approach, where the stride equals the prediction length. The test set consists of the last h r time steps, where h is the forecast horizon and r is the number of rolling evaluation windows. The validation set is defined as the last forecast horizon before the test set, while the training set includes all preceding data.
Hardware Specification Yes All MOIRAI-MOE models are trained on 16 A100 (40G) GPUs using a batch size of 1,024 and bfloat16 precision.
Software Dependencies No The paper mentions using AdamW optimizer and a learning rate scheduler but does not specify versions for any programming languages or libraries (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes To ensure a fair comparison with MOIRAI in terms of activated parameters, we set the number of activated experts to 2 for MOIRAI-MOE, resulting in 11M/86M activated parameters per token for MOIRAI-MOES/MOIRAI-MOEB, closely matching the dense model MOIRAIS/MOIRAIB that contains 14M/91M activated parameters. The total number of experts is set to 32... All MOIRAI-MOE models are trained on 16 A100 (40G) GPUs using a batch size of 1,024 and bfloat16 precision. The small and base model are trained for 50,000 and 250,000 steps on LOTSA (Woo et al., 2024), respectively. The patch size P is set to 16 and the masking ratio r for next-token prediction pretraining is 0.3. For optimization, we utilize the Adam W optimizer with lr = 1e-3, weight decay = 1e-1, β1 = 0.9, β2 = 0.98. We also apply a learning rate scheduler with linear warmup for the first 10,000 steps, followed by cosine annealing.