CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

Authors: Xiangsen Chen, Xuan Feng, Shuo Chen, Matthieu Maitre, Sudipto Rakshit, Diana Duvieilh, Ashley Picone, Nan Tang

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation using this benchmark reveals important insights into the limitations of current LLMs. For example, LLMs often lack the nuanced expertise required to handle complex details and struggle to distinguish between correct and incorrect information. ... We assess four LLMs: two base models (GPT-4o, o3-mini) and two models fine-tuned on a CTI-specific corpus. ... This section presents the detailed experimental results in our comprehensive threat research benchmark for each stage of the workflow.
Researcher Affiliation Collaboration Xiangsen Chen13 , Xuan Feng1, Shuo Chen1, Matthieu Maitre2, Sudipto Rakshit2, Diana Duvieilh2, Ashley Picone2, Nan Tang3,4 1Microsoft Research 2Microsoft 3Hong Kong University of Science and Techonology (Guangzhou) 4Hong Kong University of Science and Techonology
Pseudocode No The paper describes the methodology in narrative form without presenting any explicit pseudocode blocks or algorithm listings.
Open Source Code Yes The code of Cyber Threat-Eval benchmark is available at https://github.com/secintelligence/Cyber Threat-Eval.
Open Datasets Yes To address these issues, we introduce Cyber Threat-Eval, which is collected from the daily CTI workflow of a world-leading company. This expert-annotated benchmark assesses LLMs on practical tasks across all three stages as mentioned above. ... We will release both Cyber Threat-Eval and TRA to support the community in advancing analyst-oriented CTI automation.
Dataset Splits Yes For rigorous monitoring of the fine-tuning process, the aggregated dataset is partitioned into training (79%), validation (1%), and testing (20%) sets.
Hardware Specification No The paper mentions the LLMs used (GPT-4o, o3-mini) but does not provide any specific details about the hardware (e.g., GPU, CPU models, or memory) used for running the experiments or training the fine-tuned models.
Software Dependencies No The paper describes the use of LLMs (GPT-4o, o3-mini) and fine-tuning but does not specify any software dependencies like programming language versions, libraries, or frameworks with their version numbers.
Experiment Setup Yes All models operate at the temperature of 0.01 and seed as 42 for consistency.