reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Chronos: Learning the Language of Time Series

Authors: Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, Bernie Wang

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive evaluation across 42 datasets establishes Chronos as a benchmark for both in-domain and zero-shot forecasting, surpassing both traditional models and task-specific deep learning approaches. Notably, Chronos achieves impressive zero-shot forecasting performance out of the box, without necessitating task-specific adjustments.
Researcher Affiliation	Collaboration	Abdul Fatir Ansari1 , Lorenzo Stella1 , Caner Turkmen1, Xiyuan Zhang3 , Pedro Mercado1, Huibin Shen1, Oleksandr Shchur1, Syama Sundar Rangapuram1, Sebastian Pineda Arango4 , Shubham Kapoor1, Jasper Zschiegner , Danielle C. Maddix1, Hao Wang1,5 , Michael W. Mahoney2,6 , Kari Torkkola2, Andrew Gordon Wilson2,7 , Michael Bohlke-Schneider1, Yuyang Wang1 EMAIL 1AWS AI Labs, 2Amazon Supply Chain Optimization Technologies, 3UC San Diego, 4University of Freiburg, 5Rutgers University, 6UC Berkeley, 7New York University
Pseudocode	Yes	The complete pseudocode of TSMixup can be found in Algorithm 1 in Appendix A. Intuitively, TSMixup enhances the diversity of data by combining patterns from different time series. Figure 2 shows example augmentations generated by TSMixup and illustrates how different patterns are mixed. ... see Algorithm 1 and Algorithm 2 in Appendix A
Open Source Code	Yes	Code and Pretrained Models: https://github.com/amazon-science/chronos-forecasting
Open Datasets	Yes	To train and evaluate Chronos models, we collected a wide variety of publicly available datasets spanning various application domains including energy, transport, healthcare, retail, web, weather, finance, and with sampling frequencies ranging from 5 minutes up to yearly. The complete list of datasets, together with their respective sources and additional details, is given in Appendix B. In total, our dataset collection comprises 55 datasets from multiple sources, including the Monash Time Series Forecasting Repository (Godahewa et al., 2021), the M-competitions (Makridakis et al., 1979; Makridakis & Hibon, 2000; Makridakis et al., 2020; 2022), and public domain datasets from Kaggle.1 The datasets used in our experiments are available at https://huggingface.co/datasets/autogluon/chronos_datasets.
Dataset Splits	No	For both in-domain (I) and zero-shot (II) benchmark datasets, we used the last H N+ observations of each time series as a held-out test set: all models are judged by the accuracy of their forecast on such held-out set, which no model had access to for training purposes. The prediction length H is task-specific (see Table 3 in Appendix B), where we define a task as a dataset and prediction length pair.
Hardware Specification	Yes	We used an AWS EC2 instance with 8 A100 (40GB) GPUs to train all Chronos models, and we employed faster floating point formats (TF32) and model compilation to speed up training.
Software Dependencies	No	The other model and training hyperparameters were set to their defaults used in the transformers library (Wolf et al., 2020).
Experiment Setup	Yes	We trained T5 models of 4 sizes,2 namely, Mini (20M), Small (46M), Base (200M) and Large (710M), and the GPT-2 base model (90M), on 10M TSMixup augmentations (see Section 4.1) generated from the 28 training datasets, with K = 3 in Algorithm 1, and 1M synthetic time series generated using Gaussian processes (see Section 4.2). Note that with this setup, original time series are adequately represented since they are included in the TSMixup augmentations with probability 1/3. We sampled time series from the augmentations and synthetic data in the ratio 9:1 during training. Each model is trained with an effective batch size of 256 sequences, using distributed data parallelism and gradient accumulation, whenever necessary. These sequences were constructed by slicing random windows from the time series, and then scaling and quantizing them into equal-sized bins within the interval [c1= 15, c B= + 15], as described in Section 3.1. We set the vocabulary size, Vts, to 4096, including the special tokens (PAD and EOS). The context length of the sequences was set to 512, the default for T5 models, and the prediction length was set to 64, a value greater than the prediction lengths of all tasks we consider in our evaluation. The models were optimized for 200K steps using the Adam W optimizer with a weight decay of 0.01. The learning rate was annealed linearly from its initial value of 0.001 to 0 over the training steps. The other model and training hyperparameters were set to their defaults used in the transformers library (Wolf et al., 2020).