Bagged Regularized k-Distances for Anomaly Detection
Authors: Yuchao Cai, Hanfang Yang, Yuheng Ma, Hanyuan Hang
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the practical side, we conduct numerical experiments to illustrate the insensitivity of the parameter selection of our algorithm compared with other state-of-the-art distance-based methods. Furthermore, our method achieves superior performance on real-world datasets with the introduced bagging technique compared to other approaches. [...] Section 5 presents numerical experiments. |
| Researcher Affiliation | Collaboration | Yuchao Cai EMAIL Department of Statistics and Data Science National University of Singapore 117546, Singapore [...] Hanyuan Hang EMAIL Hong Kong Research Institute Contemporary Amperex Technology (Hong Kong) Limited Hong Kong Science Park, New Territories, Hong Kong |
| Pseudocode | Yes | Algorithm 1: Surrogate Risk Minimization (SRM) [...] Algorithm 2: Bagged Regularized k-Distances for Anomaly Detection (BRDAD) |
| Open Source Code | No | The paper does not provide an explicit statement or link to the source code for the methodology described in this paper. |
| Open Datasets | Yes | To provide an extensive experimental evaluation, we use the latest anomaly detection benchmark repository named ADBench established by Han et al. (2022). |
| Dataset Splits | No | The paper mentions categorizing datasets into small, medium, and large based on sample size and sets the number of bagging rounds (B) accordingly. It also states, "In practice, when B is fixed, we randomly divide the data into B subsets, each containing either n/B or n/B + 1 samples." However, it does not provide specific percentages or absolute counts for training, validation, and test splits for the overall experimental evaluation on the ADBench datasets. |
| Hardware Specification | No | The paper discusses computational efficiency and parallel computation but does not specify any particular hardware components (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions using "the implementation of the Python package Py OD with its default parameters" for comparison methods like k-NN, LOF, and OCSVM, and "the author's implementation" for DTM and PIDForest. However, it does not specify version numbers for Python or any of these packages, which is necessary for reproducibility. |
| Experiment Setup | Yes | (i) BRDAD is our proposed algorithm, with details provided in Algorithm 2. The choice of B depends on the sample size: for n (0, 10, 000], (10, 000, 50, 000], and (50, 000, + ), we set B = 1, 5, and 10, respectively. [...] (ii) Distance-To-Measure (DTM) (Gu et al., 2019) [...] the number of neighbors k is fixed to be k = 0.03 sample size. [...] (v) Partial Identification Forest (PIDForest) (Gopalan et al., 2019) [...] with the number of trees T = 50, the number of buckets B = 5, and the depth of trees p = 10 suggested by the authors. |