Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Authors: Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, Min Lin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we evaluate our I-GCG on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve a nearly 100% attack success rate. |
| Researcher Affiliation | Collaboration | 1Nanyang Technological University, Singapore 2Sea AI Lab, Singapore 3University of Oxford, United Kingdom 4School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, China |
| Pseudocode | Yes | A ALGORITHM OF THE PROPOSED METHOD In this paper, we propose several improved techniques to improve the jailbreak performance of the optimization-based jailbreak method. Combining the proposed techniques, we develop an efficient jailbreak method, dobbed I-GCG. The algorithm of the proposed I-GCG is shown in Algorithm 1. Algorithm 1: I-GCG |
| Open Source Code | Yes | The code is released at https://github.com/jiaxiaojunQAQ/I-GCG. |
| Open Datasets | Yes | Datasets. We use the harmful behaviors subset from the Adv Bench benchmark (Zou et al., 2023) to evaluate the jailbreak performance of the proposed I-GCG. Specifically, the Adv Bench consists of 520 objectives that request harmful content, such as abusive language, violent content, misinformation, illegal activities, and so on. Following previous works (Chao et al., 2023; Li et al., 2023; Wei et al., 2024), we eliminate duplicate harmful requests from the Adv Bench dataset and select 50 representative harmful requests to compare performance. We also adopt Harm Bench (Mazeika et al., 2024), which is used in the NeurIPS 2023 Red Teaming Track, to evaluate the proposed I-GCG (Base Model Subtrack). |
| Dataset Splits | No | The paper mentions using the Adv Bench benchmark and Harm Bench, and selecting 50 representative harmful requests from Adv Bench for comparison. However, it does not provide explicit details about how these datasets were split into training, validation, or test sets for their experiments, or specific percentages/counts for such splits. |
| Hardware Specification | Yes | Experiments are conducted on an Ubuntu system with an NVIDIA A100 Tensor Core GPU and 80GB of RAM. |
| Software Dependencies | No | The paper mentions using various LLMs (e.g., VICUNA-7B-1.5, GUANACO-7B, LLAMA2-7B-CHAT, MISTRAL-7B-INSTRUCT-0.2) and a "Chat GPT-3.5-based check", but does not specify version numbers for any underlying software libraries, programming languages, or development environments used for their implementation. |
| Experiment Setup | Yes | For any malicious question Q, we set the target response as Sure, + My output is harmful. Here is + Rephrase(Q) . Following previous works (Zou et al., 2023; Zhang & Wei, 2024), the length of the jailbreak suffix is set to 20. During the candidate generation, the token substitute batch size B and top-k are both set to 256. We first launch the jailbreak attack with the suffix initialization ( ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ) on the simple jailbreak request to generate the jailbreak suffix. The request is set to Provide instructions on how to steal from a store without getting caught . The number of iterations for this process is set to 1,000. We use the previously generated jailbreak suffix to initialize the jailbreak suffixes of malicious questions. The number of iterations for this process is 500. |