TOTEM: TOkenized Time Series EMbeddings for General Time Series Analysis

Authors: Sabera J Talukder, Yisong Yue, Georgia Gkioxari

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate TOTEM extensively over nearly 500 experiments on three commonly-studied time series tasks with real-world data: imputation (17 baselines, 12 datasets), anomaly detection (19 baselines, 25 datasets), and forecasting (14 baselines, 12 datasets). We conclude that TOTEM matches or outperforms existing state-of-the-art models in both the canonical specialist setting (i.e., training one model on one domain) as well as the generalist setting (i.e., training a single model on many domains), which demonstrates the efficacy of tokenization for general time series analysis. The open-source implementation is available here: https://github.com/Sabera Talukder/TOTEM; a video summary is available here: https://www.youtube.com/watch?v=Oqr Cpdb6MJk.
Researcher Affiliation Academia Sabera Talukder EMAIL Yisong Yue EMAIL Georgia Gkioxari EMAIL California Institute of Technology
Pseudocode No The paper describes the VQVAE training objective mathematically and the forecaster processes conceptually, and provides architectural diagrams (e.g., Figure 1, Figure 4). However, it does not include a distinct section or figure explicitly labeled as 'Pseudocode' or 'Algorithm' with structured, code-like steps for any procedure.
Open Source Code Yes The open-source implementation is available here: https://github.com/Sabera Talukder/TOTEM
Open Datasets Yes We evaluate TOTEM extensively over nearly 500 experiments on three commonly-studied time series tasks with real-world data: imputation (17 baselines, 12 datasets), anomaly detection (19 baselines, 25 datasets), and forecasting (14 baselines, 12 datasets). [...] For the in-domain testing regime, we test on 6 datasets, and for the zero-shot testing regime, we evaluate on an additional 5 datasets. We also perform additional evaluations in the appendix on the Physio Net Challenge 2012 dataset. In total, we evaluate on 12 distinct datasets. See Table 2B for a summary. [...] Table 2: Imputation baselines and datasets. [...] B. Imputation Datasets [...] Weather W Zhou et al. (2023) Electricity E Zhou et al. (2023) ETTm1 m1 Zhou et al. (2023) ETTm2 m2 Zhou et al. (2023) ETTh1 h1 Zhou et al. (2023) ETTh2 h2 Zhou et al. (2023) Neuro2 N2 Peterson et al. (2022) Neuro5 N5 Peterson et al. (2022) Saugeen River Flow R Godahewa et al. (2021) U.S. Births B Godahewa et al. (2021) Sunspot S Godahewa et al. (2021) Appendix Physio Net Silva et al. (2012)
Dataset Splits Yes We experiment with four canonical masking percentages at 12.5%, 25%, 37.5%, 50%, and report the resulting MSE and MAE. [...] The long-term forecasting task uses standardized input and output lengths across all datasets (in particular an input length of 96 timesteps and output lengths of 96, 192, 336, and 720 timesteps), as enforced by a large body of existing work Liu et al. (2023); Wu et al. (2022); Liu et al. (2022b); Zhou et al. (2022) among others. [...] Table 19: Long term vs. short term forecasting lookback and lookahead lengths. We see that long term forecasting is far more stereotyped, and therefore easier to build generalist models for, than short term forecasting. [...] Long Term Forecasting; In-Domain Testing All Datasets (enforced by us, Liu et al. (2023); Wu et al. (2022); Liu et al. (2022b); Zhou et al. (2022) 96 96, 192, 336, 720 Long Term Forecasting; Zero Shot Testing All Datasets 96 96, 192, 336, 720
Hardware Specification Yes We train using Adam with a base learning rate of 0.0001 and a one cycle learning rate scheduler in accordance with Nie et al. (2022) on A100s. [...] Table 31: Comparison of Parameters and Training Time between TOTEM (Ours) and GPT2 generalist models. [...] TOTEM (Ours) Training Time on 1 A100 [...] GPT2 Training Time on 1 A100
Software Dependencies No The paper mentions using the Adam optimizer and a one cycle learning rate scheduler, but it does not specify exact version numbers for any software libraries (e.g., PyTorch, TensorFlow, CUDA) or the programming language used.
Experiment Setup Yes In all experiments, we use a compression factor of F = 4, (see Table 32). [...] The forecaster is trained in a supervised fashion by minimizing three smooth L1 losses between predictions { yi, µi, σi}S i=1 and their ground truth values respectively. [...] We experiment with four canonical masking percentages at 12.5%, 25%, 37.5%, 50%, and report the resulting MSE and MAE. [...] VQVAE. For imputation, anomaly detection, and forecasting the VQVAE s number of residual layers = 2, residual hidden size = 64, and block hidden size = 128 for all datasets. Each residual block has 2 non-causal, non-dilated 1D convolutional layers. The residual blocks are paired with additional non-causal, non-dilated 1D convolutional layers, where the number of additional layers is determined by the desired compression factor. See Table 32 for more hyperparameter details. Downstream Forecaster. The downstream forecaster has two components the transformer encoder that intakes codes and outputs a normalized time forecast, and the feedforward neural network that takes in time and outputs predictions for the forecast s mean and standard deviation. The downstream forecaster is a transformer encoder with a model dimension = 64, hidden dimension = 256, number of heads = 4, number of layers = 4. The transformer encoder applies a sin / cos positional embedding along the time dimension and applies its attention mechanism to each sensor independently. There is a single linear layer applied after the transformer encoder output. The feedforward neural network takes in the input time steps, and predicts the future s mean and standard deviation. [...] In imputation, anomaly detection, and forecasting the VQVAE is trained with a learning rate of 0.001 using the Adam optimizer, embedding dimension of 64, commitment cost of 0.25, and compression factor of 4; see Table 32 for more hyperparameters. The codewords are uniformly randomly initialized over [ 1 K ], where K is the number of codewords and D is the latent dimension. [...] In forecasting the downstream model is a transformer encoder with 4 layers and 4 attention heads and a feed-forward hidden dimension of 256. We train using Adam with a base learning rate of 0.0001 and a one cycle learning rate scheduler in accordance with Nie et al. (2022) on A100s. [...] Table 32: VQVAE Hyperparameters (A) Imputation generalist (All) and specialists. (B) Anomaly detection generalist (All) and specialists. The anomaly %s for all of the zero shot datasets are 2%. (C) Forecasting generalist (All) and specialists.