reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

Authors: Xiaogeng Liu, Peiran Li, G. Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, Chaowei Xiao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on public benchmarks and datasets (Mazeika et al., 2024; Souly et al., 2024; Lapid et al., 2024; Qiu et al., 2023; Zou et al., 2023; Luo et al., 2024) to evaluate our method. The results demonstrate that our method is capable of automatically discovering jailbreak strategies and achieving high attack success rates on both open-sourced and closed-sourced LLMs in a black-box manner, surpassing the runner-up baseline (Samvelyan et al., 2024) by 74.3% on average across different victim models on Harmbench (Mazeika et al., 2024).
Researcher Affiliation	Collaboration	1 University of Wisconsin Madison 2 NVIDIA 3 Cornell University 4 Washington University, St. Louis 5 University of Michigan, Ann Arbor 6 The Ohio State University 7 UIUC
Pseudocode	Yes	The algorithmic outline is provided in Appendix. D. Algorithm 1 Auto DAN-Turbo Warm-up Stage Algorithm 2 Auto DAN-Turbo Lifelong Learning Stage Algorithm 3 Auto DAN-Turbo Testing Stage
Open Source Code	Yes	Code: https://github.com/Sa Fo Lab-WISC/Auto DAN-Turbo
Open Datasets	Yes	We conduct extensive experiments on public benchmarks and datasets (Mazeika et al., 2024; Souly et al., 2024; Lapid et al., 2024; Qiu et al., 2023; Zou et al., 2023; Luo et al., 2024) to evaluate our method. We choose the Harmbench textual behavior dataset (abbr. as Harmbench dataset) (Mazeika et al., 2024) to evaluate our method and other baselines. The Harm Bench dataset contains 400 diverse malicious requests that violate laws or norms and are difficult to replicate with a search engine, ensuring they present unique risks when performed by LLMs, making this dataset an excellent resource for assessing the practical risks of jailbreak attacks.
Dataset Splits	Yes	To evaluate Auto DAN-Turbo, as described in Sec. 3.4, we will first undertake a warm-up exploration stage on the initial dataset that contains 50 malicious requests, 150 times (N=150) to establish our initial strategy library. Subsequently, using this initial strategy library, we perform a running-time lifelong learning stage, for each malicious request in the Harmbench dataset, we conduct 5 rounds of attacks. A complete round of attacks is defined as iterating through all malicious data in the dataset. For each data instance, we set T as 150 and ST as 8.5. In the evaluation, we fix the skill library and conduct another round of attacks on the Harmbench dataset.
Hardware Specification	Yes	Auto DAN-Turbo is designed with a flexible memory requirement, making it adept at handling large models such as the Llama-3-70B, which has an extensive parameter list requiring approximately 140GB of VRAM. Even when operating as the attacker, target, or summarizer LLM, a setup of 4 * Nvidia A100 PCIe 40GB GPU (total VRAM = 160GB) is more than sufficient. However, the minimum requirement is a single Nvidia RTX4090 GPU, ensuring at least 28GB of VRAM to run the Llama-2-7B model in full precision.
Software Dependencies	Yes	For open-source LLMs, we include Llama-2-7B-chat (Touvron et al., 2023), Llama-2-13B-chat (Touvron et al., 2023), Llama-2-70B-chat (Touvron et al., 2023), Llama-3-8B-Instruct (Dubey et al., 2024), Llama-3-70B-Instruct (Dubey et al., 2024), and Gemma-1.1-7Bit (Team et al., 2024b). For closed-source models, we include GPT-4-1106-turbo (Open AI et al., 2024) and Gemini Pro (Team et al., 2024a). The specific roles these models serve, whether as the attacker LLM, the target LLM, or the strategy summarizer LLM, will be detailed in the corresponding contexts. To ensure the consistency of our experiments, we used Gemma-1.1-7B-it as our scorer LLM throughout.
Experiment Setup	Yes	To evaluate Auto DAN-Turbo, as described in Sec. 3.4, we will first undertake a warm-up exploration stage on the initial dataset that contains 50 malicious requests, 150 times (N=150) to establish our initial strategy library. Subsequently, using this initial strategy library, we perform a running-time lifelong learning stage, for each malicious request in the Harmbench dataset, we conduct 5 rounds of attacks. A complete round of attacks is defined as iterating through all malicious data in the dataset. For each data instance, we set T as 150 and ST as 8.5. In the evaluation, we fix the skill library and conduct another round of attacks on the Harmbench dataset. Note that throughout our experiments, we employed a deterministic generation approach by using a zero temperature setting, and limited the maximum token generation to 4096 tokens.