reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark

Authors: Yili Wang, Yixin Liu, Xu Shen, Chenyu Li, Rui Miao, Kaize Ding, Ying Wang, Shirui Pan, Xin Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To bridge the gap, in this work, we present a Unified Benchmark for unsupervised Graph-level OOD and anoma Ly Detection (UB-GOLD), a comprehensive evaluation framework that unifies GLAD and GLOD under the concept of generalized graph-level OOD detection. Our benchmark encompasses 35 datasets spanning four practical anomaly and OOD detection scenarios, facilitating the comparison of 18 representative GLAD/GLOD methods. We conduct multi-dimensional analyses to explore the effectiveness, OOD sensitivity spectrum, robustness, and efficiency of existing methods, shedding light on their strengths and limitations.
Researcher Affiliation	Academia	1Jilin University, 2Griffith University, 3Northwestern University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes various algorithms (Graph kernel with detector, Self-supervised learning with detector, GNN-based GLAD methods, GNN-based GLOD methods) in Section 3.2 and Appendix C, but it does not provide any structured pseudocode or algorithm blocks. The methods are described in paragraph form or as categorized lists.
Open Source Code	Yes	Furthermore, we provide an open-source codebase (https://github.com/UB-GOLD/UB-GOLD) of UB-GOLD to foster reproducible research and outline potential directions for future investigations based on our insights.
Open Datasets	Yes	Our benchmark encompasses 35 datasets spanning four practical anomaly and OOD detection scenarios, facilitating the comparison of 18 representative GLAD/GLOD methods. ... Our datasets are publicly available and include TUDataset, OGB, TOX21, Drug OOD, and GOOD. Among them, TUDataset (Morris et al., 2020), OGB (Hu et al., 2020), and TOX21 (Abdelaziz et al., 2016) are licensed under the MIT License. Drug OOD (Ji et al., 2023) is licensed under the GNU General Public License 3.0. GOOD (Gui et al., 2022) is licensed under GPL-3.0.
Dataset Splits	Yes	Data split. In our target scenarios (i.e., unsupervised GLAD/GLOD), all the samples in the training set are normal/ID, while the anomaly/OOD samples only occur in the testing set. In such an unsupervised case, the validation set with anomaly/OOD samples is usually unavailable during the training phase. Thus, following the implementation of Open OOD (Zhang et al., 2023), we divide the datasets into training and testing sets, without using a validation set. Specifically, we adopted the splits from (Liu et al., 2023a) and (Li et al., 2022), applying them to the benchmark datasets. Detailed splits are provided in Table 1.
Hardware Specification	Yes	All our experiments were carried out on a Linux server with an Intel(R) Xeon(R) Gold 5120 2.20GHz CPU, 160GB RAM, and NVIDIA A40 GPU, 48GB RAM.
Software Dependencies	Yes	This toolkit is built on top of Pytorch 2.01 (Paszke et al., 2019), torch_geometric 2.4.0 (Fey & Lenssen, 2019) and DGL 2.1.0 (Wang et al., 2019). We implement graph kernel methods with the DGL library. All other models are unified using the torch_geometric library. GCL and IG are included via the PYGCL library (Zhu et al., 2021).
Experiment Setup	Yes	Hyperparameter search. To obtain the performance upper bounds of various methods on GLAD/GLOD tasks, we conduct a random search to find the optimal hyperparameters w.r.t. their performance on the testing set. The search space is detailed in Table 4. The random search is conducted 20 times or for a maximum of one day per method per dataset to ensure fairness.