Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts
Authors: Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Junnan Li, Silvio Savarese, Caiming Xiong, Doyen Sahoo
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on 39 datasets demonstrate the superiority of MOIRAI-MOE over state-of-the-art foundation models. This study also conducts comprehensive model analyses to explore the inner workings of time series Mo E foundation models. |
| Researcher Affiliation | Collaboration | 1Salesforce AI Research 2National University of Singapore 3The Hong Kong University of Science and Technology (Guangzhou). Correspondence to: Chenghao Liu <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical equations (e.g., equations 1-7) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete statement about the release of source code or a link to a code repository. |
| Open Datasets | Yes | Extensive evaluations on 39 datasets demonstrate the superiority of MOIRAI-MOE over state-of-the-art foundation models. This study also conducts comprehensive model analyses to explore the inner workings of time series Mo E foundation models. In-distribution evaluation using 29 datasets from the Monash benchmark (Godahewa et al., 2021). We conduct zero-shot evaluations on the datasets listed in Table 7, which cover five domains and span frequencies ranging from minute-level to weekly. |
| Dataset Splits | Yes | We use a non-overlapping rolling window approach, where the stride equals the prediction length. The test set consists of the last h r time steps, where h is the forecast horizon and r is the number of rolling evaluation windows. The validation set is defined as the last forecast horizon before the test set, while the training set includes all preceding data. |
| Hardware Specification | Yes | All MOIRAI-MOE models are trained on 16 A100 (40G) GPUs using a batch size of 1,024 and bfloat16 precision. |
| Software Dependencies | No | The paper mentions using AdamW optimizer and a learning rate scheduler but does not specify versions for any programming languages or libraries (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | To ensure a fair comparison with MOIRAI in terms of activated parameters, we set the number of activated experts to 2 for MOIRAI-MOE, resulting in 11M/86M activated parameters per token for MOIRAI-MOES/MOIRAI-MOEB, closely matching the dense model MOIRAIS/MOIRAIB that contains 14M/91M activated parameters. The total number of experts is set to 32... All MOIRAI-MOE models are trained on 16 A100 (40G) GPUs using a batch size of 1,024 and bfloat16 precision. The small and base model are trained for 50,000 and 250,000 steps on LOTSA (Woo et al., 2024), respectively. The patch size P is set to 16 and the masking ratio r for next-token prediction pretraining is 0.3. For optimization, we utilize the Adam W optimizer with lr = 1e-3, weight decay = 1e-1, β1 = 0.9, β2 = 0.98. We also apply a learning rate scheduler with linear warmup for the first 10,000 steps, followed by cosine annealing. |