reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CyberPal.AI: Empowering LLMs with Expert-Driven Cybersecurity Instructions

Authors: Matan Levi, Yair Allouche, Daniel Ohayon, Anton Puzanov

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations demonstrate a significant average improvement of up to 24% over the baseline models, underscoring the benefits of our expert-driven instruction dataset generation process. Overall, our fine-tuned models achieved a substantial average improvement of 18-24% across all CTI evaluation datasets.
Researcher Affiliation	Collaboration	Matan Levi1,2, Yair Allouche1, Daniel Ohayon1, Anton Puzanov1 1IBM Research 2Ben-Gurion University
Pseudocode	No	The paper describes processes in narrative steps (e.g., 'Our processing consists of the following four steps:') and provides flowcharts (Figure 1), but does not present any explicitly labeled 'Pseudocode' or structured algorithm blocks.
Open Source Code	No	The paper mentions using open-source models and public benchmarks, and refers to URLs for external resources (e.g., 'https://github.com/Sigma HQ/sigma', 'https://github.com/Ebazhanov/linkedin-skill-assessmentsquizzes', 'https://github.com/Xuanwu AI/Sec Eval'), but does not provide a specific link or explicit statement about releasing the source code for the methodology described in this paper.
Open Datasets	Yes	We used the following public multi-choice tasks: CISSP Assessment Questions, MMLU Computer Security (Sec MMLU) (Hendrycks et al. 2020), Cybersecurity Skill Assessment2 , Cyber Metric (Tihanyi et al. 2024), Cyber Threat Intelligence Multiple Choice Questions (CTI-MCQ) (Alam et al. 2024), and Sec Eval (Li et al. 2023).
Dataset Splits	No	To ensure no data contamination between the fine-tuning and testing phases, we partitioned the raw documents into train and test sets, such that the model did not encounter any test-related documents during fine-tuning. However, no specific percentages, absolute counts, or methodology for this partitioning are provided for reproducibility.
Hardware Specification	No	The paper discusses training details such as learning rates, context length, and effective batch size, but does not provide any specific hardware details like GPU or CPU models used for the experiments.
Software Dependencies	No	The paper mentions the use of various large language models (Mixtral, Mistral-Large-Instruct, Llama-3 instruct 8B, Mistral instruct 7B v0.3, and Phi-3-medium-4k-instruct) but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	We employ a learning rate of 4e 5 for Llama and Phi, and 3e 5 for mistral. Additionally, we employ linear warm-up for 125 steps. The context length is set to 4096, and effective batch size of 2048 is achieved using gradient accumulation. Based on our empirical findings, beyond 2 epochs, we observed that additional epochs have negligible impact on the final loss before the model starts to overfit.