Injecting Universal Jailbreak Backdoors into LLMs in Minutes
Authors: Zhuowei Chen, qiannan zhang, Shichao Pei
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Jailbreak Edit achieves a high jailbreak success rate on jailbreak prompts while preserving generation quality, and safe performance on normal queries. Our findings underscore the effectiveness, stealthiness, and explainability of Jailbreak Edit, emphasizing the need for more advanced defense mechanisms in LLMs. |
| Researcher Affiliation | Academia | Zhuowei Chen1, Qiannan Zhang2, Shichao Pei3 1Guangdong University of Foreign Studies, China, 2Cornell University, USA, 3University of Massachusetts Boston, USA |
| Pseudocode | Yes | Algorithm 1: The Jailbreak Edit Attack Algorithm Input: LLM F; Target layer l; Target nodes N; Unsafe context E; Backdoor b Output: Backdoored LLM F /* Initialize v */ v vl; /* Estimate v */ while not converged do /* Compute Primary Loss Lp */ Lp 1 |N||E| P|N| i=1 P|E| j=1 log PM(vl:= v)[ni|ej b]; Update v using Adam; /* Compute k */ k Eq.(5); /* Update the parameters of the specific layer of MLP */ ˆW Eq.(4); return Backdoored LLM F |
| Open Source Code | Yes | Our experimental code is now available at https://github.com/johnnychanv/Jailbreak Edit. |
| Open Datasets | Yes | We adopt three different datasets that contain toxic prompts that may cause harmful responses from LLMs. Namely, Do-Anything-Now (DAN) (Shen et al., 2023), Do-Not-Answer (DNA) (Wang et al., 2023), and Addition (Sun et al., 2024). |
| Dataset Splits | Yes | We adopt three different datasets that contain toxic prompts that may cause harmful responses from LLMs. Namely, Do-Anything-Now (DAN) (Shen et al., 2023), Do-Not-Answer (DNA) (Wang et al., 2023), and Addition (Sun et al., 2024). ... Dataset statistics are demonstrated in Figure 7, Avg. #Words denotes the average word number, separated with space. Table 7: Data Statistics Dataset Size Avg. #words Do-Anything-Now 390 12.65 Do-Not-Answer 343 9.99 Addition 441 19.43 ... We follow the original open-source 5-shot evaluation setting of MMLU to implement the evaluation process (Hendrycks et al., 2020). |
| Hardware Specification | Yes | Model Editing. We performed the proposed Jailbreak Edit to get malicious experimental for model editing. All are calculated with an NVIDIA 80GB A800. In target estimation, the learning rate is set to 5e-1, weight decay is set to 1e-3, and the edited transformer layer is 5th. Generations. For all experimented generative LLMs, we perform decoding with the top-k value set to 15 and max new tokens set to 4096, and all other hyper-parameters are in default. All 7b and 6b models are evaluated with an NVIDIA 48GB RTX8000, and all 13b models are evaluated with an NVIDIA 80GB A800, with random seeds being set to 42. ... In this experiment, we executed code from a Jupyter Notebook on a device equipped with an A800 80G GPU and an Intel(R) Xeon(R) Gold 6348 CPU. |
| Software Dependencies | No | In this experiment, we executed code from a Jupyter Notebook on a device equipped with an A800 80G GPU and an Intel(R) Xeon(R) Gold 6348 CPU. ... Evaluations. To detect jailbreak harmful responses and analyze the actions of the LLMs, we follow Sun et al. (2024), utilized an open-source classifiers6 to evaluate the models generations, which is released by Wang et al. (2023). (Explanation: The paper mentions "Jupyter Notebook" and "open-source classifiers" but does not provide specific version numbers for any software dependencies.) |
| Experiment Setup | Yes | Model Editing. We performed the proposed Jailbreak Edit to get malicious experimental for model editing. All are calculated with an NVIDIA 80GB A800. In target estimation, the learning rate is set to 5e-1, weight decay is set to 1e-3, and the edited transformer layer is 5th. Generations. For all experimented generative LLMs, we perform decoding with the top-k value set to 15 and max new tokens set to 4096, and all other hyper-parameters are in default. All 7b and 6b models are evaluated with an NVIDIA 48GB RTX8000, and all 13b models are evaluated with an NVIDIA 80GB A800, with random seeds being set to 42. |