IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

Authors: Ziyang Li, Saikat Dutta, Mayur Naik

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For evaluation, we curate a new dataset, CWE-Bench-Java, comprising 120 manually validated security vulnerabilities in real-world Java projects. A state-of-the-art static analysis tool Code QL detects only 27 of these vulnerabilities whereas IRIS with GPT-4 detects 55 (+28) and improves upon Code QL s average false discovery rate by 5% points. Furthermore, IRIS identifies 4 previously unknown vulnerabilities which cannot be found by existing tools.
Researcher Affiliation Academia Ziyang Li University of Pennsylvania EMAIL Saikat Dutta Cornell University EMAIL Mayur Naik University of Pennsylvania EMAIL
Pseudocode No The paper does not contain any sections explicitly labeled as "Pseudocode" or "Algorithm". While it includes Code QL scripts and LLM prompts that resemble code, they are not presented as general pseudocode for the IRIS framework's logical steps.
Open Source Code Yes IRIS is available publicly at https://github.com/iris-sast/iris.
Open Datasets Yes We curate a new dataset, CWE-Bench-Java, comprising 120 manually validated security vulnerabilities in real-world Java projects. ... The dataset and the corresponding scripts to fetch, build, and analyze the Java projects are available publicly at https://github.com/iris-sast/cwe-bench-java.
Dataset Splits No The paper describes a dataset (CWE-Bench-Java) used for evaluation, but it does not specify explicit training, validation, or test splits. The evaluation is performed on the entire dataset as a benchmark for detecting vulnerabilities, and LLMs are used for inference rather than being trained on specific splits of this dataset.
Hardware Specification Yes To run the open-source LLMs we use two groups of machines: a 2.50GHz Intel Xeon machine, with 40 CPUs, four Ge Force RTX 2080 Ti GPUs, and 750GB RAM, and another 3.00GHz Intel Xeon machine with 48 CPUs, 8 A100s, and 1.5T RAM.
Software Dependencies Yes We select two closed-source LLMs from Open AI: GPT-4 (gpt-4-0125-preview) and GPT-3.5 (gpt-3.5-turbo-0125) for our evaluation. We also select instruction-tuned versions of six state-of-the-art open-source LLMs via huggingface API: Llama 3 8B and 70B, Qwen-2.5-Coder (Q2.5C) 32B, Gemma-2 (G2) 27B, and Deep Seek Coder (DSC) 7B. ... We use Code QL version 2.15.3 as the backbone of our static analysis.
Experiment Setup Yes In terms of few-shot examples passed to labeling external APIs, we use 4 examples for CWE-22, 3 examples for CWE-78, 3 examples for CWE-79, and 3 examples for CWE-94. We use a temperature of 0, maximum tokens to 2048, and top-p of 1 for inference with all the LLMs. For GPT 3.5 and GPT 4, we also fix a seed to mitigate randomness as much as possible. ... We use batch size of 20 and 30 for internal and external, respectively. ... We chose 5 lines as a balanced approach to provide sufficient context while managing performance and cost. ... In our experiments, we observe that setting S to 10 provides a good balance between the cost and accuracy.