reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Case for Learned Provenance-based System Behavior Baseline

Authors: Yao Zhu, Zhenyuan Li, Yangyang Wei, Shouling Ji

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation demonstrates the method s accuracy and adaptability in anomaly path mining, significantly advancing the state-of-the-art in handling and analyzing provenance graphs for anomaly detection. Our comprehensive evaluation, conducted on large-scale and open-source datasets, confirms the effectiveness and efficiency of our provenance graph embedding method. The results highlight its accuracy and adaptability in real-time anomaly path mining tasks, demonstrating its potential to significantly enhance anomaly detection capabilities. Section 4. Experiments.
Researcher Affiliation	Academia	1College of Computer Science and Technology, Zhejiang University, Hangzhou, China 2College of Software Technology, Zhejiang University, Ningbo, China.
Pseudocode	No	The paper describes methods and processes in text, such as the 'tag-propagation framework consists of four main stages: tag initialization, propagation, removal, and alert triggering', but does not present these as structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Available at https://github.com/Addo Zhu/ behavior_baseline
Open Datasets	Yes	In our experiments, we utilized datasets from the DARPA Transparent Computing (TC) dataset (tra, 2015.2), which contains millions of benign and hundreds of malicious events collected from platforms with diverse background activities, providing provenance-rich data capturing system events and dependencies over time. This dataset includes a series of realistically simulated Advanced Persisitent Threats (APT), such as malware execution, privilege escalation, remote exploitation, and data exfiltration. We primarily use the E3-CADETS dataset, constructing a training dataset with 1,042k system events and a testing dataset with 26k events. Additionally, we demonstrate the adaptability of our method on other dataests in Appendix D.3.
Dataset Splits	Yes	We primarily use the E3-CADETS dataset, constructing a training dataset with 1,042k system events and a testing dataset with 26k events.
Hardware Specification	Yes	We conducted all experiments on an Ubuntu 18.04.6 LTS server with an Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 251Gi B of memory, and three NVIDIA Ge Force RTX 3090 GPUs.
Software Dependencies	No	We utilize the Tensor Flow library to construct neural network models, enabling flexible modifications of layer configurations to implement various architectures such as MLP, LSTM and CNN. This approach facilitates the evaluation of different machine learning models in the provenance graph embedding and anomaly path mining task.
Experiment Setup	Yes	Through controlled variable experiments, we ultimately employed the Adam optimizer with a learning rate of 0.001, and 200 training epochs for each model configuration. we evaluated the learned model s performance based on the prediction accuracy of event regular scores. A prediction is considered true if the difference between the predicted regular score and the true score is less than a threshold of 0.2, which is selected because the frequencies of negative saples we constructed are generally below this value. For the anomaly path mining task, we use path-level precision, recall, and F1 score as evaluation metrics. In the comparative experiments, due to the varying detection granularity across methods, we adopt node-level metrics for consistency. ... we use an L1 kernel regularizer with a coefficient of 0.001, the Adam optimizer (learning rate = 0.001), and the Mean Squared Error (MSE) loss function. To study the impacts of batch sizes, we use different batch sizes from 32 to 2048 to train MLP models on the E3-CADETS dataset