Towards Neural Scaling Laws for Time Series Foundation Models

Authors: Qingren Yao, Chao-Han Huck Yang, Renhe Jiang, Yuxuan Liang, Ming Jin, Shirui Pan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments reveal that the negative log-likelihood of TSFMs exhibits similar scaling behavior in both OOD and ID settings. We further compare the scaling properties across different architectures, incorporating two state-of-the-art TSFMs as case studies, showing that model architecture plays a significant role in scaling.
Researcher Affiliation Collaboration Qingren Yao1,2, Chao-Han Huck Yang3, Renhe Jiang4, Yuxuan Liang2 , Ming Jin1 , Shirui Pan1 1Griffith University 2The Hong Kong University of Science and Technology (Guangzhou) 3NVIDIA Research 4The University of Tokyo
Pseudocode No The paper describes methods and processes in narrative text and mathematical formulations but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The source code and related source of this work are available at https://github.com/Qingrenn/TSFM-Scaling Laws for reproducibility.
Open Datasets Yes To this end, we constructed our time series corpus for TSFM pre-training from the large-scale open time series archive, LOTSA (Woo et al., 2024). The corpus comprises approximately 17B time points from 39 datasets spanning seven distinct domains. ... A detailed breakdown of the data sources is provided in Appendix A, with a summary in Table 1.
Dataset Splits Yes For each subset, 95% of the data was allocated for model training, with the remaining 5% reserved as a validation set to evaluate in-distribution forecasting performance. Additionally, we used a subset from a widely recognized long-sequence prediction benchmark (Wu et al., 2023) to test the model s out-of-distribution forecasting capabilities. To further enhance the reliability, we also incorporated a subset of the Monash dataset (Godahewa et al., 2021) as additional OOD test data.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies No The paper mentions using the "Adam W optimizer" but does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used in the implementation.
Experiment Setup Yes Our training objective is to optimize the mixture distribution log-likelihood. We utilize the Adam W optimizer with a batch size of 128, and a maximum learning rate of 10 3 with a linear warm-up of 104 training steps, followed by cosine decay for the remaining 9 104 steps. ... In our baseline models, the patch size P is set to 32. ... We sample 15% 50% lengths as forecast horizon and the remaining as context horizon, for a given time series.