reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

Authors: Ziyang Li, Saikat Dutta, Mayur Naik

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For evaluation, we curate a new dataset, CWE-Bench-Java, comprising 120 manually validated security vulnerabilities in real-world Java projects. A state-of-the-art static analysis tool Code QL detects only 27 of these vulnerabilities whereas IRIS with GPT-4 detects 55 (+28) and improves upon Code QL s average false discovery rate by 5% points. Furthermore, IRIS identifies 4 previously unknown vulnerabilities which cannot be found by existing tools.
Researcher Affiliation	Academia	Ziyang Li University of Pennsylvania EMAIL Saikat Dutta Cornell University EMAIL Mayur Naik University of Pennsylvania EMAIL
Pseudocode	No	The paper does not contain any sections explicitly labeled as "Pseudocode" or "Algorithm". While it includes Code QL scripts and LLM prompts that resemble code, they are not presented as general pseudocode for the IRIS framework's logical steps.
Open Source Code	Yes	IRIS is available publicly at https://github.com/iris-sast/iris.
Open Datasets	Yes	We curate a new dataset, CWE-Bench-Java, comprising 120 manually validated security vulnerabilities in real-world Java projects. ... The dataset and the corresponding scripts to fetch, build, and analyze the Java projects are available publicly at https://github.com/iris-sast/cwe-bench-java.
Dataset Splits	No	The paper describes a dataset (CWE-Bench-Java) used for evaluation, but it does not specify explicit training, validation, or test splits. The evaluation is performed on the entire dataset as a benchmark for detecting vulnerabilities, and LLMs are used for inference rather than being trained on specific splits of this dataset.
Hardware Specification	Yes	To run the open-source LLMs we use two groups of machines: a 2.50GHz Intel Xeon machine, with 40 CPUs, four Ge Force RTX 2080 Ti GPUs, and 750GB RAM, and another 3.00GHz Intel Xeon machine with 48 CPUs, 8 A100s, and 1.5T RAM.
Software Dependencies	Yes	We select two closed-source LLMs from Open AI: GPT-4 (gpt-4-0125-preview) and GPT-3.5 (gpt-3.5-turbo-0125) for our evaluation. We also select instruction-tuned versions of six state-of-the-art open-source LLMs via huggingface API: Llama 3 8B and 70B, Qwen-2.5-Coder (Q2.5C) 32B, Gemma-2 (G2) 27B, and Deep Seek Coder (DSC) 7B. ... We use Code QL version 2.15.3 as the backbone of our static analysis.
Experiment Setup	Yes	In terms of few-shot examples passed to labeling external APIs, we use 4 examples for CWE-22, 3 examples for CWE-78, 3 examples for CWE-79, and 3 examples for CWE-94. We use a temperature of 0, maximum tokens to 2048, and top-p of 1 for inference with all the LLMs. For GPT 3.5 and GPT 4, we also fix a seed to mitigate randomness as much as possible. ... We use batch size of 20 and 30 for internal and external, respectively. ... We chose 5 lines as a balanced approach to provide sufficient context while managing performance and cost. ... In our experiments, we observe that setting S to 10 provides a good balance between the cost and accuracy.