Towards Neural Scaling Laws for Time Series Foundation Models
Authors: Qingren Yao, Chao-Han Huck Yang, Renhe Jiang, Yuxuan Liang, Ming Jin, Shirui Pan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments reveal that the negative log-likelihood of TSFMs exhibits similar scaling behavior in both OOD and ID settings. We further compare the scaling properties across different architectures, incorporating two state-of-the-art TSFMs as case studies, showing that model architecture plays a significant role in scaling. |
| Researcher Affiliation | Collaboration | Qingren Yao1,2, Chao-Han Huck Yang3, Renhe Jiang4, Yuxuan Liang2 , Ming Jin1 , Shirui Pan1 1Griffith University 2The Hong Kong University of Science and Technology (Guangzhou) 3NVIDIA Research 4The University of Tokyo |
| Pseudocode | No | The paper describes methods and processes in narrative text and mathematical formulations but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code and related source of this work are available at https://github.com/Qingrenn/TSFM-Scaling Laws for reproducibility. |
| Open Datasets | Yes | To this end, we constructed our time series corpus for TSFM pre-training from the large-scale open time series archive, LOTSA (Woo et al., 2024). The corpus comprises approximately 17B time points from 39 datasets spanning seven distinct domains. ... A detailed breakdown of the data sources is provided in Appendix A, with a summary in Table 1. |
| Dataset Splits | Yes | For each subset, 95% of the data was allocated for model training, with the remaining 5% reserved as a validation set to evaluate in-distribution forecasting performance. Additionally, we used a subset from a widely recognized long-sequence prediction benchmark (Wu et al., 2023) to test the model s out-of-distribution forecasting capabilities. To further enhance the reliability, we also incorporated a subset of the Monash dataset (Godahewa et al., 2021) as additional OOD test data. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications). |
| Software Dependencies | No | The paper mentions using the "Adam W optimizer" but does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used in the implementation. |
| Experiment Setup | Yes | Our training objective is to optimize the mixture distribution log-likelihood. We utilize the Adam W optimizer with a batch size of 128, and a maximum learning rate of 10 3 with a linear warm-up of 104 training steps, followed by cosine decay for the remaining 9 104 steps. ... In our baseline models, the patch size P is set to 32. ... We sample 15% 50% lengths as forecast horizon and the remaining as context horizon, for a given time series. |