TimeSeAD: Benchmarking Deep Multivariate Time-Series Anomaly Detection

Authors: Dennis Wagner, Tobias Michels, Florian C.F. Schulz, Arjun Nair, Maja Rudolph, Marius Kloft

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide the largest benchmark of deep multivariate time-series anomaly detection methods to date. We focus on deep-learning based methods and multivariate data, a common setting in modern anomaly detection. We present the largest comprehensive benchmark so far for multivariate time-series AD, comparing 28 deep-learning methods on 21 datasets.
Researcher Affiliation Collaboration Dennis Wagner EMAIL RPTU Kaiserslautern-Landau Tobias Michels EMAIL RPTU Kaiserslautern-Landau Florian C.F. Schulz EMAIL TU Berlin Arjun Nair EMAIL RPTU Kaiserslautern-Landau Maja Rudolph EMAIL Bosch AI Marius Kloft EMAIL RPTU Kaiserslautern-Landau
Pseudocode No The paper describes methods and their mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks. For example, in Section 4.2 it introduces the calculation of TRec and Theorem 1, but this is a mathematical definition, not pseudocode.
Open Source Code Yes We provide all implementations and analysis tools in a new comprehensive library for Time Series Anomaly Detection, called Time Se AD1. 1https://github.com/wagner-d/Time Se AD
Open Datasets Yes Our analysis first examines some of the most commonly used datasets. These datasets are the backbone of time-series AD evaluation and have been used in virtually all major comparisons in the field (Schmidl et al., 2022; Garg et al., 2021; Choi et al., 2021; Jacob et al., 2020). Our analysis reveals several significant flaws in these datasets. Second, we investigate the shortcomings of frequently used evaluation metrics, particularly the point-wise F1-score and its adaptations. Lastly, we examine the inconsistencies and other problems within established evaluation protocols.
Dataset Splits Yes We split the unlabelled data into two distinct sets Dtrain and Dval1 such that Dtrain contains 75% of the available time points and Dval1 contains 25%. ... we also need to split the labeled data into another validation set Dval2 and a test set Dtest. ... we attempt to mitigate it by performing a modified 5-fold cross validation. For that, we split the time series into five equally sized folds and use each fold as the validation set once. The remaining folds, excluding the ones directly next to the validation fold to reduce possible statistical interdependencies, form the test set.
Hardware Specification No The paper describes the computational framework and experiment management but does not provide specific details about the hardware (e.g., GPU/CPU models) used for running the experiments.
Software Dependencies Yes We implemented all methods and datasets as part of our Time Se AD library based on Py Torch (Paszke et al., 2019). To keep track of our training and evaluation experiments, we also developed a plugin for our library based on sacred (Greff et al., 2017).
Experiment Setup Yes To perform grid search without introducing significant bias in the evaluation, we remove part of the test set to tune the parameters on, before evaluating with the best performing parameters on the rest. Because of distributional changes in the test set, a fixed, arbitrary split can introduce further bias. To mitigate its effects, instead, we perform cross-validation on the test set, splitting it into multiple folds and using each fold once as a validation set. Finally, to mitigate the impact of temporal dependencies between folds, we remove the neighboring folds of each validation set. To ensure a fair evaluation, we choose a maximum training time and adjust the size of the parameter grid, such that each method can be fully evaluated within this time frame. We use this evaluation protocol for all methods3.