reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

Authors: Xiangsen Chen, Xuan Feng, Shuo Chen, Matthieu Maitre, Sudipto Rakshit, Diana Duvieilh, Ashley Picone, Nan Tang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation using this benchmark reveals important insights into the limitations of current LLMs. For example, LLMs often lack the nuanced expertise required to handle complex details and struggle to distinguish between correct and incorrect information. ... We assess four LLMs: two base models (GPT-4o, o3-mini) and two models ﬁne-tuned on a CTI-speciﬁc corpus. ... This section presents the detailed experimental results in our comprehensive threat research benchmark for each stage of the workﬂow.
Researcher Affiliation	Collaboration	Xiangsen Chen13 , Xuan Feng1, Shuo Chen1, Matthieu Maitre2, Sudipto Rakshit2, Diana Duvieilh2, Ashley Picone2, Nan Tang3,4 1Microsoft Research 2Microsoft 3Hong Kong University of Science and Techonology (Guangzhou) 4Hong Kong University of Science and Techonology
Pseudocode	No	The paper describes the methodology in narrative form without presenting any explicit pseudocode blocks or algorithm listings.
Open Source Code	Yes	The code of Cyber Threat-Eval benchmark is available at https://github.com/secintelligence/Cyber Threat-Eval.
Open Datasets	Yes	To address these issues, we introduce Cyber Threat-Eval, which is collected from the daily CTI workﬂow of a world-leading company. This expert-annotated benchmark assesses LLMs on practical tasks across all three stages as mentioned above. ... We will release both Cyber Threat-Eval and TRA to support the community in advancing analyst-oriented CTI automation.
Dataset Splits	Yes	For rigorous monitoring of the ﬁne-tuning process, the aggregated dataset is partitioned into training (79%), validation (1%), and testing (20%) sets.
Hardware Specification	No	The paper mentions the LLMs used (GPT-4o, o3-mini) but does not provide any specific details about the hardware (e.g., GPU, CPU models, or memory) used for running the experiments or training the fine-tuned models.
Software Dependencies	No	The paper describes the use of LLMs (GPT-4o, o3-mini) and fine-tuning but does not specify any software dependencies like programming language versions, libraries, or frameworks with their version numbers.
Experiment Setup	Yes	All models operate at the temperature of 0.01 and seed as 42 for consistency.