reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SPADE: Semi-supervised Anomaly Detection under Distribution Mismatch

Authors: Jinsung Yoon, Kihyuk Sohn, Chun-Liang Li, Sercan O Arik, Tomas Pfister

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to highlight the beneﬁts of the proposed method, SPADE, in various practical settings of semi-supervised learning with distribution mismatch. We consider multiple anomaly detection datasets for image and tabular data types. As image data, we use MVTec anomaly detection (Bergmann et al., 2019) and Magnetic tile datasets (Huang et al., 2020). As tabular data, we use Covertype, Thyroid, and Drug datasets (see Appendix for detailed data description). In Sec. 5.4, we further utilize two real-world fraud detection datasets (Kaggle credit and Xente) to evaluate the performance of SPADE. We run 5 independent experiments and report average values (standard deviations can be found in Appendix C). We use AUC as the evaluation metric. Ablation studies.
Researcher Affiliation	Industry	EMAIL Google Cloud AI
Pseudocode	Yes	Algorithm 1 Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling (SPADE).
Open Source Code	No	The paper provides GitHub links for baselines used (e.g., VIME, Fix Match, DANN, pulearn, pytorch-cutpaste) in footnotes B.6, but does not provide a specific link or explicit statement for the open-sourcing of the SPADE methodology itself. Therefore, concrete access to source code for the described methodology is not provided.
Open Datasets	Yes	We use MVTec anomaly detection (Bergmann et al., 2019) and Magnetic tile datasets (Huang et al., 2020). As tabular data, we use Covertype, Thyroid, and Drug datasets (see Appendix for detailed data description). In Sec. 5.4, we further utilize two real-world fraud detection datasets (Kaggle credit and Xente). Kaggle credit card fraud1 (footnote: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) Xente fraud detection2 (footnote: https://zindi.africa/competitions/xente-fraud-detection-challenge/data) Appendix B.1: Thyroid data (footnote: https://archive.ics.uci.edu/ml/datasets/thyroid+disease), Drug data (footnote: https://archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29), Covertype data (footnote: https://archive.ics.uci.edu/ml/datasets/covertype), MVTec data (footnote: https://www.mvtec.com/company/research/datasets/mvtec-ad), Magnetic Tile dataset (footnote: https://github.com/abin24/Magnetic-tile-defect-datasets.)
Dataset Splits	Yes	In all experiments, unless the dataset comes with its own train and test split, we randomly divide the dataset into disjoint train and test data. Then, we further divide the training data into disjoint labeled and unlabeled data. Note that we only provide 5% of the data as labeled data for tabular datasets and 20% for image datasets, for the scenario of new types of anomalies. For Thyroid data... We use the pre-deﬁned training and testing dataset division. For Drug data... We divide the entire dataset into training (50%) and testing (50%). For Covertype data... We divide the entire dataset into training (50%) and testing (50%). For MVTec data... we ﬁrst mix given training and testing data and divide them into training (80%) and testing (20%). For Magnetic Tile dataset... We mix given training and testing data and divide them into training (80%) and testing (20%). In our experiments [for fraud detection], we split the train and test data based on the measurement time. Latest samples are included in the testing data (50%) and early acquired data is included in the training data (50%). We further divide the training data as labeled and unlabeled data. Early acquired data are included in the labeled data (5%-20%), while later acquired data are included in the unlabeled data (80%-95%).
Hardware Specification	Yes	All the experiments are done on a single V100 GPU.
Software Dependencies	No	The paper mentions various software components used (e.g., scikit-learn, VIME, Fix Match, DANN, Weighted Elkanoto, Bagging PU, Cut Paste, Gaussian Distribution Estimator) but does not provide specific version numbers for these libraries or frameworks as used in the authors' implementation of SPADE. Although some baselines link to GitHub repositories, the exact versions used for the experiments are not specified.
Experiment Setup	Yes	We set both α and β as 1.0 for the experiments. Training loss is used for the convergence criteria if the training loss is converged (if no improvement is observed in the loss for 5 epochs), we treat that the models are converged as well. For image data, we use Res Net-18 as the base network architecture. For representation learning, we incorporate Cut Paste (Li et al., 2021) for MVTec and Magnetic Tile datasets. We follow all the training details in (Li et al., 2021) (including all the hyper-parameters). For tabular data, we use two-layer perceptron as the base network architecture where the hidden dimensions is the half of the original feature dimensions. Pseudo-labelers consist of 5 Gaussian Distribution Estimator (GDE) based OCCs.