Unsupervised Anomaly Detection for Tabular Data Using Deep Noise Evaluation

Authors: Wei Dai, Kai Hwang, Jicong Fan

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments through more than 60 benchmark datasets demonstrate the effectiveness of the proposed method in comparison to 12 baselines of UAD. Our method obtains a 92.27% AUC score and a 1.68 ranking score on average.
Researcher Affiliation Academia Wei Dai, Kai Hwang, Jicong Fan* The Chinese University of Hong Kong, Shenzhen, China EMAIL, EMAIL, EMAIL
Pseudocode Yes A detailed process for generating noised samples on a batch of data is in Algorithm 1. According to Algorithm 1, the noise generation time complexity is O(bd). In contrast, other methods involving perturbation (Cai and Fan 2022; Qiu et al. 2021) and adversarial sample (Goyal et al. 2020) have time complexity with O(bd W), where W is workload related to a neural network module. Compared with them, our noise generation is more efficient, where no learnable parameter is required. A comparative study on time cost is shown in Appendix I. In Figure 2, the lower pathway illustrates the noise synthesis mechanism, where a noise vector ϵ (mean 0, standard deviation σ) is randomly generated and added to the input x, producing a noise-augmented variant ˆx = x + ϵ. Multiple noised samples can be generated from a single input. Both x and ˆx are processed by the noise evaluation network hθ, optimized to regress towards zero for x and estimate the noise vector |ϵ| for ˆx. Training details are in Algorithm 2 of Appendix B.
Open Source Code No The paper does not provide an explicit statement or a direct link to the source code for the methodology described in this paper. It mentions using code provided by authors for baseline methods (DPAD, PLAD, SCAD, Neu Tra LAD), but not for their own proposed method.
Open Datasets Yes We evaluate our method in two common settings: unsupervised anomaly detection and one-class classification. In the anomaly detection setting, where anomalous samples are few, we use 47 real-world tabular datasets2 from (Han et al. 2022), covering domains like healthcare, image processing, and finance. For one-class classification setting, we collected 25 benchmark tabular datasets used in previous works (Pang et al. 2021; Shenkar and Wolf 2022). The raw data was sourced from the UCI Machine Learning Repository (Kelly, Longjohn, and Nottingham 0) and their official websites. 2https://github.com/Minqi824/ADBench/
Dataset Splits Yes If only a training set is available, we randomly split 50% of normal samples for training and use the rest with anomalous data for testing. Data is standardized using the training set s mean and standard deviation.
Hardware Specification Yes All experiments are implemented by Pytorch (Paszke et al. 2017) on NVIDIA Tesla V100 and Intel Xeon Gold 6200 platform.
Software Dependencies No The paper mentions 'Pytorch (Paszke et al. 2017)' and 'Py OD, a Python library developed by (Zhao, Nasrullah, and Li 2019)' but does not provide specific version numbers for these or any other software components used.
Experiment Setup Yes The network model is optimized by AMSGrad (Reddi, Kale, and Kumar 2018) with 10 4 learning rate and 5 10 4 weight decay. In the results, we adopt Gaussian noise in the noise generation, maximum noise level σmax = 2, and the number of different noise distributions m = 3. To enlarge the noised sample number, we generate 3 noise-augmented instances for the same input instance with 3 different noise ratios, 0.5, 0.8, and 1.0 at each epoch. Unless specified, we train the model for 500 epochs and manually decay the learning rate at the 100-th epoch in a factor of 0.1.