reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Black-Box Adversarial Attacks on LLM-Based Code Completion

Authors: Slobodan Jenko, Niels Mündler, Jingxuan He, Mark Vero, Martin Vechev

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present the first attack, named INSEC, that achieves this goal. INSEC works by injecting an attack string as a short comment in the completion input. The attack string is crafted through a query-based optimization procedure starting from a set of carefully designed initialization schemes. We demonstrate INSEC s broad applicability and effectiveness by evaluating it on various state-ofthe-art open-source models and black-box commercial services (e.g., Open AI API and Git Hub Copilot). On a diverse set of security-critical test cases, covering 16 CWEs across 5 programming languages, INSEC increases the rate of generated insecure code by more than 50%, while maintaining the functional correctness of generated code. 4. Experimental Evaluation In this section, we present an extensive evaluation of INSEC, ablations and properties beyond the initial design.
Researcher Affiliation	Academia	1Department of Computer Science, ETH Zurich, Switzerland 2UC Berkeley, USA.
Pseudocode	Yes	Algorithm 1: Attack string optimization. 1 Procedure optimize(Dtrain vul , Dval vul, n) Input : Dtrain vul , training dataset Dval vul, validation dataset n, attack string pool size Output :the final attack string 2 P = init pool(Dtrain vul ) 3 P = pick n best(P, n, Dtrain vul ) 5 Pnew = [mutate(σ) for σ in P] 6 Pnew = Pnew + P 7 P = pick n best(Pnew, n, Dtrain vul ) 8 until optimization finishes or budget is used up 9 return pick n best(P, 1, Dval vul)
Open Source Code	Yes	We publicly release our dataset and code implementation.1 1https://github.com/eth-sri/insec.
Open Datasets	Yes	We publicly release our dataset and code implementation.1 1https://github.com/eth-sri/insec. To evaluate INSEC, we construct a comprehensive vulnerability dataset consisting of 16 instances of the Common Weakness Enumeration (CWEs) in 5 popular programming languages.
Dataset Splits	Yes	We evenly split the 12 tasks for each CWE into Dtrain vul for optimization, Dval vul for hyperparameter tuning and ablations, and Dtest vul for our main results. We divide these datasets into a validation set Dval func and a test set Dtest func, of sizes 140 and 600, respectively.
Hardware Specification	No	Moreover, INSEC requires only minimal hardware and monetary costs, e.g., less than $10 for the development of an attack with GPT-3.5-Turbo-Instruct. the optimization phase of our attack required around 6 hours to find a highly effective string on commercial GPUs. Assuming a cost of between $1 and $2 per GPU per hour (Lambda Labs, 2025; Data Crunch, 2025) results in estimated cost of $6 to $12.
Software Dependencies	No	As the vulnerability judgment function, we use Code QL, a state-of-the-art static analyzer adopted in recent research as the standard tool for determining the security of generated code (Pearce et al., 2022; He & Vechev, 2023). As models, we used gpt-3.5-turbo-instruct-0914 for GPT-3.5-Turbo-Instruct and the standard Git Hub Copilot plugin as of June 2024. In our experiments, we use the Code Qwen tokenizer (Bai et al., 2023), a publicly available tokenizer different from tokenizers of any of the targeted models.
Experiment Setup	Yes	The results in our main experiments (i.e., Figure 3) are obtained with the following configuration: attack comment positioned in the line above the completion point, optimization and initialization combined, Code Qwen tokenizer (Bai et al., 2023), pool size n = 20, and, following He & Vechev (2023), sampling temperature during optimization and evaluation 0.4. The number of tokens in the attack string is set to nσ = 5 for all engines and vulnerabilities except: nσ = 10 for Copilot on five vulnerabilities, and nσ = 15 for Copilot on one vulnerability. During optimization, for each candidate string, we sample 16 completions per task to approximate vul Rate in Equation (1).