TimeSeAD: Benchmarking Deep Multivariate Time-Series Anomaly Detection
Authors: Dennis Wagner, Tobias Michels, Florian C.F. Schulz, Arjun Nair, Maja Rudolph, Marius Kloft
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide the largest benchmark of deep multivariate time-series anomaly detection methods to date. We focus on deep-learning based methods and multivariate data, a common setting in modern anomaly detection. We present the largest comprehensive benchmark so far for multivariate time-series AD, comparing 28 deep-learning methods on 21 datasets. |
| Researcher Affiliation | Collaboration | Dennis Wagner EMAIL RPTU Kaiserslautern-Landau Tobias Michels EMAIL RPTU Kaiserslautern-Landau Florian C.F. Schulz EMAIL TU Berlin Arjun Nair EMAIL RPTU Kaiserslautern-Landau Maja Rudolph EMAIL Bosch AI Marius Kloft EMAIL RPTU Kaiserslautern-Landau |
| Pseudocode | No | The paper describes methods and their mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks. For example, in Section 4.2 it introduces the calculation of TRec and Theorem 1, but this is a mathematical definition, not pseudocode. |
| Open Source Code | Yes | We provide all implementations and analysis tools in a new comprehensive library for Time Series Anomaly Detection, called Time Se AD1. 1https://github.com/wagner-d/Time Se AD |
| Open Datasets | Yes | Our analysis first examines some of the most commonly used datasets. These datasets are the backbone of time-series AD evaluation and have been used in virtually all major comparisons in the field (Schmidl et al., 2022; Garg et al., 2021; Choi et al., 2021; Jacob et al., 2020). Our analysis reveals several significant flaws in these datasets. Second, we investigate the shortcomings of frequently used evaluation metrics, particularly the point-wise F1-score and its adaptations. Lastly, we examine the inconsistencies and other problems within established evaluation protocols. |
| Dataset Splits | Yes | We split the unlabelled data into two distinct sets Dtrain and Dval1 such that Dtrain contains 75% of the available time points and Dval1 contains 25%. ... we also need to split the labeled data into another validation set Dval2 and a test set Dtest. ... we attempt to mitigate it by performing a modified 5-fold cross validation. For that, we split the time series into five equally sized folds and use each fold as the validation set once. The remaining folds, excluding the ones directly next to the validation fold to reduce possible statistical interdependencies, form the test set. |
| Hardware Specification | No | The paper describes the computational framework and experiment management but does not provide specific details about the hardware (e.g., GPU/CPU models) used for running the experiments. |
| Software Dependencies | Yes | We implemented all methods and datasets as part of our Time Se AD library based on Py Torch (Paszke et al., 2019). To keep track of our training and evaluation experiments, we also developed a plugin for our library based on sacred (Greff et al., 2017). |
| Experiment Setup | Yes | To perform grid search without introducing significant bias in the evaluation, we remove part of the test set to tune the parameters on, before evaluating with the best performing parameters on the rest. Because of distributional changes in the test set, a fixed, arbitrary split can introduce further bias. To mitigate its effects, instead, we perform cross-validation on the test set, splitting it into multiple folds and using each fold once as a validation set. Finally, to mitigate the impact of temporal dependencies between folds, we remove the neighboring folds of each validation set. To ensure a fair evaluation, we choose a maximum training time and adjust the size of the parameter grid, such that each method can be fully evaluated within this time frame. We use this evaluation protocol for all methods3. |