reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bagged Regularized k-Distances for Anomaly Detection

Authors: Yuchao Cai, Hanfang Yang, Yuheng Ma, Hanyuan Hang

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On the practical side, we conduct numerical experiments to illustrate the insensitivity of the parameter selection of our algorithm compared with other state-of-the-art distance-based methods. Furthermore, our method achieves superior performance on real-world datasets with the introduced bagging technique compared to other approaches. [...] Section 5 presents numerical experiments.
Researcher Affiliation	Collaboration	Yuchao Cai EMAIL Department of Statistics and Data Science National University of Singapore 117546, Singapore [...] Hanyuan Hang EMAIL Hong Kong Research Institute Contemporary Amperex Technology (Hong Kong) Limited Hong Kong Science Park, New Territories, Hong Kong
Pseudocode	Yes	Algorithm 1: Surrogate Risk Minimization (SRM) [...] Algorithm 2: Bagged Regularized k-Distances for Anomaly Detection (BRDAD)
Open Source Code	No	The paper does not provide an explicit statement or link to the source code for the methodology described in this paper.
Open Datasets	Yes	To provide an extensive experimental evaluation, we use the latest anomaly detection benchmark repository named ADBench established by Han et al. (2022).
Dataset Splits	No	The paper mentions categorizing datasets into small, medium, and large based on sample size and sets the number of bagging rounds (B) accordingly. It also states, "In practice, when B is fixed, we randomly divide the data into B subsets, each containing either n/B or n/B + 1 samples." However, it does not provide specific percentages or absolute counts for training, validation, and test splits for the overall experimental evaluation on the ADBench datasets.
Hardware Specification	No	The paper discusses computational efficiency and parallel computation but does not specify any particular hardware components (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments.
Software Dependencies	No	The paper mentions using "the implementation of the Python package Py OD with its default parameters" for comparison methods like k-NN, LOF, and OCSVM, and "the author's implementation" for DTM and PIDForest. However, it does not specify version numbers for Python or any of these packages, which is necessary for reproducibility.
Experiment Setup	Yes	(i) BRDAD is our proposed algorithm, with details provided in Algorithm 2. The choice of B depends on the sample size: for n (0, 10, 000], (10, 000, 50, 000], and (50, 000, + ), we set B = 1, 5, and 10, respectively. [...] (ii) Distance-To-Measure (DTM) (Gu et al., 2019) [...] the number of neighbors k is ﬁxed to be k = 0.03 sample size. [...] (v) Partial Identiﬁcation Forest (PIDForest) (Gopalan et al., 2019) [...] with the number of trees T = 50, the number of buckets B = 5, and the depth of trees p = 10 suggested by the authors.