reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sundial: A Family of Highly Capable Time Series Foundation Models

Authors: Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, Mingsheng Long

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Sundial on best-recognized zero-shot forecasting benchmarks (Section 5.1) and investigate the scaling behavior of Sundial (Section 5.2). We compare Time Flow with other training objectives (Section 5.3). We delve into test-time calibration of generative forecasters (Section 5.4). We conduct model adaptation of Sundial, i.e., instruction tuning (Section 5.5) and provide in-depth ablation studies to evaluate our modular enhancement (Section 5.6).
Researcher Affiliation	Academia	1School of Software, BNRist, Tsinghua University. Yong Liu <EMAIL>. Guo Qin <EMAIL>. Correspondence to: Mingsheng Long <EMAIL>.
Pseudocode	Yes	Algorithm 1 Time Flow Loss: Sampling Require: condition hi RD, path steps K. 1: Sample initial noise byi N(0, I). 2: t = 1/K 3: for k in {0, 1 . . . , K 1} do 4: for byi byi + FM-Net byi, k t, hi t 5: end for 6: Return: byi
Open Source Code	Yes	Code is available at: https://github.com/thuml/Sundial.
Open Datasets	Yes	We collected and curated Time Bench, which comprises over a trillion time points from various sources, as shown in Figure 3. Several datasets originate from research teams (Woo et al., 2024; Ansari et al., 2024; Liu et al., 2024a;b). ... The statistical details of Time Bench are summarized in Table 4. In addition to open-source datasets from research teams on time series foundation models (Woo et al., 2024; Ansari et al., 2024; Liu et al., 2024b;a), we collected substantial real-world time series from various domains such as finance, Io T, meteorology, and healthcare (Goldberger et al., 2000).
Dataset Splits	No	Metrics (MSE/MAE) are calculated from all predicted windows in the test split of each dataset following Liu et al. (2024a). To prevent data leakage, we exclude all datasets evaluated in Section 5.1 to make sure that Sundial conducts zero-shot forecasting.
Hardware Specification	Yes	All experiments are implemented using Py Torch (Paszke et al., 2019) and executed with 32 NVIDIA A100 GPUs.
Software Dependencies	No	All experiments are implemented using Py Torch (Paszke et al., 2019) and executed with 32 NVIDIA A100 GPUs.
Experiment Setup	Yes	On the FEV leaderboard (Ansari et al., 2024), which consists of short-term forecasting datasets, we train Sundial models by Time Flow Loss with the prediction length of F = 16. For the point forecasting (Wu et al., 2022) and GIFT-Eval (Aksu et al., 2024), which consist of forecasting datasets with a prediction length ranging from 6 to 900, we train Sundial models by Time Flow Loss with the prediction length of F = 720. ... The sampling step is fixed as K = 50. Configurations of Sundial in different sizes are provided in Table 5.