reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Conformal Anomaly Detection in Event Sequences

Authors: Shuai Zhang, Chuan Zhou, Yang Liu, Peng Zhang, Xixun Lin, Shirui Pan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct extensive experiments to evaluate our CADES method, including GOF test for SPP (Section 4.1), anomaly detection in synthetic data (Section 4.2) and real-world data (Section 4.3), and FPR control (Section 4.4). In addition, we perform ablation studies to verify the effectiveness of combining two proposed scores and using two-sided p-values in Section 4.5. Appendix D contains runtime comparisons and additional experimental results.
Researcher Affiliation	Academia	1Academy of Mathematics and Systems Science, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences 3Cyberspace Institute of Advanced Technology, Guangzhou University 4Institute of Information Engineering, Chinese Academy of Sciences 5Griffith University.
Pseudocode	Yes	Algorithm 1 CADES: Conformal Anomaly Detection in Event Sequences
Open Source Code	Yes	2The code is publicly available at https://github.com/ Zh-Shuai/CADES.
Open Datasets	Yes	We consider nine choices for the alternative distribution: (1) Decreasing Rate, (2) Increasing Rate, (3) Inhomogeneous Poisson, (4) Stopping, (5) Renewal A, (6) Renewal B, (7) Hawkes, (8) Self Correcting, and (9) Uniform. ... We consider four synthetic datasets introduced in (Shchur et al., 2021). ... We further evaluate CADES on two benchmark real-world datasets, LOGS and STEAD, both introduced in (Shchur et al., 2021).
Dataset Splits	Yes	We randomly split the dataset D into two disjoint subsets of equal size: a training set Dtrain and a calibration set Dcal. LOGS: D, DID test, and DOOD test consist of 1668 ID sequences, 502 ID sequences, and 22 OOD sequences per each failure injection scenario, respectively. STEAD: D, DID test, and DOOD test consist of 4000 ID sequences, 1000 ID sequences, and 1000 OOD sequences per each remaining location, respectively.
Hardware Specification	Yes	All experiments in this paper are conducted on an NVIDIA RTX 3090 Ti GPU using Py Torch.
Software Dependencies	No	All experiments in this paper are conducted on an NVIDIA RTX 3090 Ti GPU using Py Torch. For kernel density estimation (KDE) in our test procedure, we use scipy.stats.gaussian kde with the parameter bw method set to h1 for sarr(X) and h2 for sint(X).
Experiment Setup	Yes	The inter-event time distribution is parameterized with a mixture of 8 Weibull distributions, the mark embedding size is set to 32, and the RNN hidden size is set to 64 for all experiments. The batch size is set to 64, the optimizer is Adam with a learning rate of 10^-3, and the L2 norm of the gradient is clipped to 5. We set the maximum number of epochs to 500 and perform early stopping if the training loss does not improve for 5 epochs.