AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
Authors: Xiaogeng Liu, Peiran Li, G. Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, Chaowei Xiao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on public benchmarks and datasets (Mazeika et al., 2024; Souly et al., 2024; Lapid et al., 2024; Qiu et al., 2023; Zou et al., 2023; Luo et al., 2024) to evaluate our method. The results demonstrate that our method is capable of automatically discovering jailbreak strategies and achieving high attack success rates on both open-sourced and closed-sourced LLMs in a black-box manner, surpassing the runner-up baseline (Samvelyan et al., 2024) by 74.3% on average across different victim models on Harmbench (Mazeika et al., 2024). |
| Researcher Affiliation | Collaboration | 1 University of Wisconsin Madison 2 NVIDIA 3 Cornell University 4 Washington University, St. Louis 5 University of Michigan, Ann Arbor 6 The Ohio State University 7 UIUC |
| Pseudocode | Yes | The algorithmic outline is provided in Appendix. D. Algorithm 1 Auto DAN-Turbo Warm-up Stage Algorithm 2 Auto DAN-Turbo Lifelong Learning Stage Algorithm 3 Auto DAN-Turbo Testing Stage |
| Open Source Code | Yes | Code: https://github.com/Sa Fo Lab-WISC/Auto DAN-Turbo |
| Open Datasets | Yes | We conduct extensive experiments on public benchmarks and datasets (Mazeika et al., 2024; Souly et al., 2024; Lapid et al., 2024; Qiu et al., 2023; Zou et al., 2023; Luo et al., 2024) to evaluate our method. We choose the Harmbench textual behavior dataset (abbr. as Harmbench dataset) (Mazeika et al., 2024) to evaluate our method and other baselines. The Harm Bench dataset contains 400 diverse malicious requests that violate laws or norms and are difficult to replicate with a search engine, ensuring they present unique risks when performed by LLMs, making this dataset an excellent resource for assessing the practical risks of jailbreak attacks. |
| Dataset Splits | Yes | To evaluate Auto DAN-Turbo, as described in Sec. 3.4, we will first undertake a warm-up exploration stage on the initial dataset that contains 50 malicious requests, 150 times (N=150) to establish our initial strategy library. Subsequently, using this initial strategy library, we perform a running-time lifelong learning stage, for each malicious request in the Harmbench dataset, we conduct 5 rounds of attacks. A complete round of attacks is defined as iterating through all malicious data in the dataset. For each data instance, we set T as 150 and ST as 8.5. In the evaluation, we fix the skill library and conduct another round of attacks on the Harmbench dataset. |
| Hardware Specification | Yes | Auto DAN-Turbo is designed with a flexible memory requirement, making it adept at handling large models such as the Llama-3-70B, which has an extensive parameter list requiring approximately 140GB of VRAM. Even when operating as the attacker, target, or summarizer LLM, a setup of 4 * Nvidia A100 PCIe 40GB GPU (total VRAM = 160GB) is more than sufficient. However, the minimum requirement is a single Nvidia RTX4090 GPU, ensuring at least 28GB of VRAM to run the Llama-2-7B model in full precision. |
| Software Dependencies | Yes | For open-source LLMs, we include Llama-2-7B-chat (Touvron et al., 2023), Llama-2-13B-chat (Touvron et al., 2023), Llama-2-70B-chat (Touvron et al., 2023), Llama-3-8B-Instruct (Dubey et al., 2024), Llama-3-70B-Instruct (Dubey et al., 2024), and Gemma-1.1-7Bit (Team et al., 2024b). For closed-source models, we include GPT-4-1106-turbo (Open AI et al., 2024) and Gemini Pro (Team et al., 2024a). The specific roles these models serve, whether as the attacker LLM, the target LLM, or the strategy summarizer LLM, will be detailed in the corresponding contexts. To ensure the consistency of our experiments, we used Gemma-1.1-7B-it as our scorer LLM throughout. |
| Experiment Setup | Yes | To evaluate Auto DAN-Turbo, as described in Sec. 3.4, we will first undertake a warm-up exploration stage on the initial dataset that contains 50 malicious requests, 150 times (N=150) to establish our initial strategy library. Subsequently, using this initial strategy library, we perform a running-time lifelong learning stage, for each malicious request in the Harmbench dataset, we conduct 5 rounds of attacks. A complete round of attacks is defined as iterating through all malicious data in the dataset. For each data instance, we set T as 150 and ST as 8.5. In the evaluation, we fix the skill library and conduct another round of attacks on the Harmbench dataset. Note that throughout our experiments, we employed a deterministic generation approach by using a zero temperature setting, and limited the maximum token generation to 4096 tokens. |