Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts
Authors: Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, Ming Jin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position TIME-MOE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility. |
| Researcher Affiliation | Collaboration | Xiaoming Shi1 , Shiyu Wang , Yuqi Nie2 , Dianqi Li, Zhou Ye, Qingsong Wen3 , Ming Jin4 1Xiaohongshu Inc 2Princeton University 3Squirrel Ai Learning 4Griffith University |
| Pseudocode | Yes | Algorithm 1 Scheduling for the Multi-resolution Forecasting |
| Open Source Code | Yes | Resources: https://github.com/Time-Mo E/Time-Mo E Our TIME-MOE models and Time-300B data collection are open-sourced. |
| Open Datasets | Yes | We introduce Time-300B, the largest open-access time series data collection, comprising over 300 billion time points spanning more than nine domains, accompanied by a well-designed data-cleaning pipeline. Our TIME-MOE models and Time-300B data collection are open-sourced. Table 1: Key statistics of the pre-training dataset Time-300B from various domains. Table 10: Datasets and key properties from Time-300B. |
| Dataset Splits | Yes | Table 9: Detailed dataset descriptions. Dataset sizes are listed as (Train, Validation, Test). ETTm1 7 {96, 192, 336, 720} (34465, 11521, 11521) 15min 0.46 Temperature |
| Hardware Specification | Yes | Training is performed on 128 NVIDIA A100-80G GPUs with BF16 precision. |
| Software Dependencies | No | The paper mentions BF16 precision, AdamW optimizer, and flash-attention, but does not provide specific version numbers for software libraries or frameworks (e.g., PyTorch 1.x, Python 3.x). |
| Experiment Setup | Yes | Each model is trained for 100,000 steps with a batch size of 1,024, and a maximum sequence length capped at 4,096. This setup processes 4 million time points per iteration. We use forecast horizons of {1, 8, 32, 64} in the output projection and set the auxiliary loss factor α to 0.02. For optimization, we apply the Adam W optimizer with the following hyperparameters: lr = 1e-3, weight decay = 1e-1, β1 = 0.9, and β2 = 0.95. A learning rate scheduler with a linear warmup for the first 10,000 steps, followed by cosine annealing, is used. |