reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TimeSeAD: Benchmarking Deep Multivariate Time-Series Anomaly Detection

Authors: Dennis Wagner, Tobias Michels, Florian C.F. Schulz, Arjun Nair, Maja Rudolph, Marius Kloft

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide the largest benchmark of deep multivariate time-series anomaly detection methods to date. We focus on deep-learning based methods and multivariate data, a common setting in modern anomaly detection. We present the largest comprehensive benchmark so far for multivariate time-series AD, comparing 28 deep-learning methods on 21 datasets.
Researcher Affiliation	Collaboration	Dennis Wagner EMAIL RPTU Kaiserslautern-Landau Tobias Michels EMAIL RPTU Kaiserslautern-Landau Florian C.F. Schulz EMAIL TU Berlin Arjun Nair EMAIL RPTU Kaiserslautern-Landau Maja Rudolph EMAIL Bosch AI Marius Kloft EMAIL RPTU Kaiserslautern-Landau
Pseudocode	No	The paper describes methods and their mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks. For example, in Section 4.2 it introduces the calculation of TRec and Theorem 1, but this is a mathematical definition, not pseudocode.
Open Source Code	Yes	We provide all implementations and analysis tools in a new comprehensive library for Time Series Anomaly Detection, called Time Se AD1. 1https://github.com/wagner-d/Time Se AD
Open Datasets	Yes	Our analysis first examines some of the most commonly used datasets. These datasets are the backbone of time-series AD evaluation and have been used in virtually all major comparisons in the field (Schmidl et al., 2022; Garg et al., 2021; Choi et al., 2021; Jacob et al., 2020). Our analysis reveals several significant flaws in these datasets. Second, we investigate the shortcomings of frequently used evaluation metrics, particularly the point-wise F1-score and its adaptations. Lastly, we examine the inconsistencies and other problems within established evaluation protocols.
Dataset Splits	Yes	We split the unlabelled data into two distinct sets Dtrain and Dval1 such that Dtrain contains 75% of the available time points and Dval1 contains 25%. ... we also need to split the labeled data into another validation set Dval2 and a test set Dtest. ... we attempt to mitigate it by performing a modified 5-fold cross validation. For that, we split the time series into five equally sized folds and use each fold as the validation set once. The remaining folds, excluding the ones directly next to the validation fold to reduce possible statistical interdependencies, form the test set.
Hardware Specification	No	The paper describes the computational framework and experiment management but does not provide specific details about the hardware (e.g., GPU/CPU models) used for running the experiments.
Software Dependencies	Yes	We implemented all methods and datasets as part of our Time Se AD library based on Py Torch (Paszke et al., 2019). To keep track of our training and evaluation experiments, we also developed a plugin for our library based on sacred (Greff et al., 2017).
Experiment Setup	Yes	To perform grid search without introducing significant bias in the evaluation, we remove part of the test set to tune the parameters on, before evaluating with the best performing parameters on the rest. Because of distributional changes in the test set, a fixed, arbitrary split can introduce further bias. To mitigate its effects, instead, we perform cross-validation on the test set, splitting it into multiple folds and using each fold once as a validation set. Finally, to mitigate the impact of temporal dependencies between folds, we remove the neighboring folds of each validation set. To ensure a fair evaluation, we choose a maximum training time and adjust the size of the parameter grid, such that each method can be fully evaluated within this time frame. We use this evaluation protocol for all methods3.