reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Authors: Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, Ming Jin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position TIME-MOE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.
Researcher Affiliation	Collaboration	Xiaoming Shi1 , Shiyu Wang , Yuqi Nie2 , Dianqi Li, Zhou Ye, Qingsong Wen3 , Ming Jin4 1Xiaohongshu Inc 2Princeton University 3Squirrel Ai Learning 4Griffith University
Pseudocode	Yes	Algorithm 1 Scheduling for the Multi-resolution Forecasting
Open Source Code	Yes	Resources: https://github.com/Time-Mo E/Time-Mo E Our TIME-MOE models and Time-300B data collection are open-sourced.
Open Datasets	Yes	We introduce Time-300B, the largest open-access time series data collection, comprising over 300 billion time points spanning more than nine domains, accompanied by a well-designed data-cleaning pipeline. Our TIME-MOE models and Time-300B data collection are open-sourced. Table 1: Key statistics of the pre-training dataset Time-300B from various domains. Table 10: Datasets and key properties from Time-300B.
Dataset Splits	Yes	Table 9: Detailed dataset descriptions. Dataset sizes are listed as (Train, Validation, Test). ETTm1 7 {96, 192, 336, 720} (34465, 11521, 11521) 15min 0.46 Temperature
Hardware Specification	Yes	Training is performed on 128 NVIDIA A100-80G GPUs with BF16 precision.
Software Dependencies	No	The paper mentions BF16 precision, AdamW optimizer, and flash-attention, but does not provide specific version numbers for software libraries or frameworks (e.g., PyTorch 1.x, Python 3.x).
Experiment Setup	Yes	Each model is trained for 100,000 steps with a batch size of 1,024, and a maximum sequence length capped at 4,096. This setup processes 4 million time points per iteration. We use forecast horizons of {1, 8, 32, 64} in the output projection and set the auxiliary loss factor α to 0.02. For optimization, we apply the Adam W optimizer with the following hyperparameters: lr = 1e-3, weight decay = 1e-1, β1 = 0.9, and β2 = 0.95. A learning rate scheduler with a linear warmup for the first 10,000 steps, followed by cosine annealing, is used.