reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Injecting Universal Jailbreak Backdoors into LLMs in Minutes

Authors: Zhuowei Chen, qiannan zhang, Shichao Pei

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Jailbreak Edit achieves a high jailbreak success rate on jailbreak prompts while preserving generation quality, and safe performance on normal queries. Our findings underscore the effectiveness, stealthiness, and explainability of Jailbreak Edit, emphasizing the need for more advanced defense mechanisms in LLMs.
Researcher Affiliation	Academia	Zhuowei Chen1, Qiannan Zhang2, Shichao Pei3 1Guangdong University of Foreign Studies, China, 2Cornell University, USA, 3University of Massachusetts Boston, USA
Pseudocode	Yes	Algorithm 1: The Jailbreak Edit Attack Algorithm Input: LLM F; Target layer l; Target nodes N; Unsafe context E; Backdoor b Output: Backdoored LLM F /* Initialize v / v vl; / Estimate v / while not converged do / Compute Primary Loss Lp / Lp 1 \|N\|\|E\| P\|N\| i=1 P\|E\| j=1 log PM(vl:= v)[ni\|ej b]; Update v using Adam; / Compute k / k Eq.(5); / Update the parameters of the specific layer of MLP */ ˆW Eq.(4); return Backdoored LLM F
Open Source Code	Yes	Our experimental code is now available at https://github.com/johnnychanv/Jailbreak Edit.
Open Datasets	Yes	We adopt three different datasets that contain toxic prompts that may cause harmful responses from LLMs. Namely, Do-Anything-Now (DAN) (Shen et al., 2023), Do-Not-Answer (DNA) (Wang et al., 2023), and Addition (Sun et al., 2024).
Dataset Splits	Yes	We adopt three different datasets that contain toxic prompts that may cause harmful responses from LLMs. Namely, Do-Anything-Now (DAN) (Shen et al., 2023), Do-Not-Answer (DNA) (Wang et al., 2023), and Addition (Sun et al., 2024). ... Dataset statistics are demonstrated in Figure 7, Avg. #Words denotes the average word number, separated with space. Table 7: Data Statistics Dataset Size Avg. #words Do-Anything-Now 390 12.65 Do-Not-Answer 343 9.99 Addition 441 19.43 ... We follow the original open-source 5-shot evaluation setting of MMLU to implement the evaluation process (Hendrycks et al., 2020).
Hardware Specification	Yes	Model Editing. We performed the proposed Jailbreak Edit to get malicious experimental for model editing. All are calculated with an NVIDIA 80GB A800. In target estimation, the learning rate is set to 5e-1, weight decay is set to 1e-3, and the edited transformer layer is 5th. Generations. For all experimented generative LLMs, we perform decoding with the top-k value set to 15 and max new tokens set to 4096, and all other hyper-parameters are in default. All 7b and 6b models are evaluated with an NVIDIA 48GB RTX8000, and all 13b models are evaluated with an NVIDIA 80GB A800, with random seeds being set to 42. ... In this experiment, we executed code from a Jupyter Notebook on a device equipped with an A800 80G GPU and an Intel(R) Xeon(R) Gold 6348 CPU.
Software Dependencies	No	In this experiment, we executed code from a Jupyter Notebook on a device equipped with an A800 80G GPU and an Intel(R) Xeon(R) Gold 6348 CPU. ... Evaluations. To detect jailbreak harmful responses and analyze the actions of the LLMs, we follow Sun et al. (2024), utilized an open-source classifiers6 to evaluate the models generations, which is released by Wang et al. (2023). (Explanation: The paper mentions "Jupyter Notebook" and "open-source classifiers" but does not provide specific version numbers for any software dependencies.)
Experiment Setup	Yes	Model Editing. We performed the proposed Jailbreak Edit to get malicious experimental for model editing. All are calculated with an NVIDIA 80GB A800. In target estimation, the learning rate is set to 5e-1, weight decay is set to 1e-3, and the edited transformer layer is 5th. Generations. For all experimented generative LLMs, we perform decoding with the top-k value set to 15 and max new tokens set to 4096, and all other hyper-parameters are in default. All 7b and 6b models are evaluated with an NVIDIA 48GB RTX8000, and all 13b models are evaluated with an NVIDIA 80GB A800, with random seeds being set to 42.