IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities
Authors: Ziyang Li, Saikat Dutta, Mayur Naik
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For evaluation, we curate a new dataset, CWE-Bench-Java, comprising 120 manually validated security vulnerabilities in real-world Java projects. A state-of-the-art static analysis tool Code QL detects only 27 of these vulnerabilities whereas IRIS with GPT-4 detects 55 (+28) and improves upon Code QL s average false discovery rate by 5% points. Furthermore, IRIS identifies 4 previously unknown vulnerabilities which cannot be found by existing tools. |
| Researcher Affiliation | Academia | Ziyang Li University of Pennsylvania EMAIL Saikat Dutta Cornell University EMAIL Mayur Naik University of Pennsylvania EMAIL |
| Pseudocode | No | The paper does not contain any sections explicitly labeled as "Pseudocode" or "Algorithm". While it includes Code QL scripts and LLM prompts that resemble code, they are not presented as general pseudocode for the IRIS framework's logical steps. |
| Open Source Code | Yes | IRIS is available publicly at https://github.com/iris-sast/iris. |
| Open Datasets | Yes | We curate a new dataset, CWE-Bench-Java, comprising 120 manually validated security vulnerabilities in real-world Java projects. ... The dataset and the corresponding scripts to fetch, build, and analyze the Java projects are available publicly at https://github.com/iris-sast/cwe-bench-java. |
| Dataset Splits | No | The paper describes a dataset (CWE-Bench-Java) used for evaluation, but it does not specify explicit training, validation, or test splits. The evaluation is performed on the entire dataset as a benchmark for detecting vulnerabilities, and LLMs are used for inference rather than being trained on specific splits of this dataset. |
| Hardware Specification | Yes | To run the open-source LLMs we use two groups of machines: a 2.50GHz Intel Xeon machine, with 40 CPUs, four Ge Force RTX 2080 Ti GPUs, and 750GB RAM, and another 3.00GHz Intel Xeon machine with 48 CPUs, 8 A100s, and 1.5T RAM. |
| Software Dependencies | Yes | We select two closed-source LLMs from Open AI: GPT-4 (gpt-4-0125-preview) and GPT-3.5 (gpt-3.5-turbo-0125) for our evaluation. We also select instruction-tuned versions of six state-of-the-art open-source LLMs via huggingface API: Llama 3 8B and 70B, Qwen-2.5-Coder (Q2.5C) 32B, Gemma-2 (G2) 27B, and Deep Seek Coder (DSC) 7B. ... We use Code QL version 2.15.3 as the backbone of our static analysis. |
| Experiment Setup | Yes | In terms of few-shot examples passed to labeling external APIs, we use 4 examples for CWE-22, 3 examples for CWE-78, 3 examples for CWE-79, and 3 examples for CWE-94. We use a temperature of 0, maximum tokens to 2048, and top-p of 1 for inference with all the LLMs. For GPT 3.5 and GPT 4, we also fix a seed to mitigate randomness as much as possible. ... We use batch size of 20 and 30 for internal and external, respectively. ... We chose 5 lines as a balanced approach to provide sufficient context while managing performance and cost. ... In our experiments, we observe that setting S to 10 provides a good balance between the cost and accuracy. |