AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses
Authors: Nicholas Carlini, Edoardo Debenedetti, Javier Rando, Milad Nasr, Florian Tramèr
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Auto Adv Ex Bench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, Auto Adv Ex Bench directly measures LLMs success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in Auto Adv Ex Bench, it would immediately present practical utility for adversarial machine learning researchers. While our strongest ensemble of agents can break 87% of CTF-like ( homework exercise ) adversarial example defenses, they break just 37% of real-world defenses, indicating a large gap between difficulty in attacking real code, and CTF-like code. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2ETH Zurich. Correspondence to: Nicholas Carlini <EMAIL>. |
| Pseudocode | No | The paper describes the attack process in four sequential steps in paragraph form within Section 5.1, but does not provide any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We make this benchmark available at https://github.com/ethz-spylab/Auto Adv Ex Bench. |
| Open Datasets | Yes | We curate a dataset of 51 real-world defense implementations. We do this by crawling arXiv papers, filtering to just those on adversarial machine learning using a simple Naive Bayes classifier, further filtering this down to a set of potential defenses to adversarial examples by few-shot prompting GPT-4o, manually filtering this down to defenses with public implementations, and further manually filtering this down to 40 reproducible GitHub repositories. ... We choose an ℓ perturbation bound of 8/255 for CIFAR-10 and Image Net, and 0.3 for MNIST standard values from the literature (Carlini et al., 2019). |
| Dataset Splits | Yes | On a 24-defense subset of our dataset containing homework-like implementations (i.e., defenses that were designed to be pedagogically useful (Carlini & Kurakin, 2020) and thus easy-to-analyze), Claude 3.5 Sonnet reaches a 75% attack success rate. But on the real world defenses, this agent succeeds only 13% of the time. ... The adversarial attack generated by the model should output 1,000 images that are perturbations of the original images under a given perturbation bound. We choose an ℓ perturbation bound of 8/255 for CIFAR-10 and Image Net, and 0.3 for MNIST standard values from the literature (Carlini et al., 2019). |
| Hardware Specification | No | The paper states: "Our agent requires between 24 and 56 hours to completely evaluate each of the 75 defenses in our benchmark on a machine with a single GPU and 16GB of VRAM." This describes generic hardware (single GPU, 16GB VRAM) but lacks specific model numbers for the GPU or CPU. |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies with version numbers for their own experimental setup. It mentions "Tensor Flow version 0.11" when discussing reproducibility issues of other papers, not their own. |
| Experiment Setup | Yes | We choose an ℓ perturbation bound of 8/255 for CIFAR-10 and Image Net, and 0.3 for MNIST standard values from the literature (Carlini et al., 2019). ... We allow each model 30 total interactions where the model selects a tool to call and then the result is provided. Of these models, Claude 3.5 Sonnet performed exceptionally well on the CTF-subset, successfully attacking 75% of defenses. ... later models like Opus 4 benefit from increasing the rounds even up to 60 interactions. |