CyberPal.AI: Empowering LLMs with Expert-Driven Cybersecurity Instructions
Authors: Matan Levi, Yair Allouche, Daniel Ohayon, Anton Puzanov
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations demonstrate a significant average improvement of up to 24% over the baseline models, underscoring the benefits of our expert-driven instruction dataset generation process. Overall, our fine-tuned models achieved a substantial average improvement of 18-24% across all CTI evaluation datasets. |
| Researcher Affiliation | Collaboration | Matan Levi1,2, Yair Allouche1, Daniel Ohayon1, Anton Puzanov1 1IBM Research 2Ben-Gurion University |
| Pseudocode | No | The paper describes processes in narrative steps (e.g., 'Our processing consists of the following four steps:') and provides flowcharts (Figure 1), but does not present any explicitly labeled 'Pseudocode' or structured algorithm blocks. |
| Open Source Code | No | The paper mentions using open-source models and public benchmarks, and refers to URLs for external resources (e.g., 'https://github.com/Sigma HQ/sigma', 'https://github.com/Ebazhanov/linkedin-skill-assessmentsquizzes', 'https://github.com/Xuanwu AI/Sec Eval'), but does not provide a specific link or explicit statement about releasing the source code for the methodology described in this paper. |
| Open Datasets | Yes | We used the following public multi-choice tasks: CISSP Assessment Questions, MMLU Computer Security (Sec MMLU) (Hendrycks et al. 2020), Cybersecurity Skill Assessment2 , Cyber Metric (Tihanyi et al. 2024), Cyber Threat Intelligence Multiple Choice Questions (CTI-MCQ) (Alam et al. 2024), and Sec Eval (Li et al. 2023). |
| Dataset Splits | No | To ensure no data contamination between the fine-tuning and testing phases, we partitioned the raw documents into train and test sets, such that the model did not encounter any test-related documents during fine-tuning. However, no specific percentages, absolute counts, or methodology for this partitioning are provided for reproducibility. |
| Hardware Specification | No | The paper discusses training details such as learning rates, context length, and effective batch size, but does not provide any specific hardware details like GPU or CPU models used for the experiments. |
| Software Dependencies | No | The paper mentions the use of various large language models (Mixtral, Mistral-Large-Instruct, Llama-3 instruct 8B, Mistral instruct 7B v0.3, and Phi-3-medium-4k-instruct) but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | We employ a learning rate of 4e 5 for Llama and Phi, and 3e 5 for mistral. Additionally, we employ linear warm-up for 125 steps. The context length is set to 4096, and effective batch size of 2048 is achieved using gradient accumulation. Based on our empirical findings, beyond 2 epochs, we observed that additional epochs have negligible impact on the final loss before the model starts to overfit. |