The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
Authors: Kristina Nikolić, Luze Sun, Jie Zhang, Florian Tramèr
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the jailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. |
| Researcher Affiliation | Academia | 1ETH Zurich 2University of Pennsylvania work done on a ETH Student Research Fellowship. Correspondence to: Kristina Nikoli c <EMAIL>. |
| Pseudocode | No | The paper describes methods and procedures in paragraph text and refers to existing algorithms by name (e.g., GCG, Auto DAN, PAIR, TAP), but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Overall, our work proposes the jailbreak tax as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at https://github. com/ethz-spylab/jailbreak-tax |
| Open Datasets | Yes | We test the model performance on 1000 bio-security questions from the Weapons of Mass Destruction Proxy (WMDP) dataset (Li et al., 2024). We primarily make use of 1000 questions from GSM8K dataset of grade school math word problems (Cobbe et al., 2021). In some of our experiments, we also use the MATH dataset (Hendrycks et al., 2020) of competition mathematics problems, split into five levels of increasing difficulty from 1 to 5. |
| Dataset Splits | No | The paper mentions using "1000 questions from GSM8K dataset" and "1000 bio-security questions from the Weapons of Mass Destruction Proxy (WMDP) dataset" for evaluation, and that the MATH dataset is "split into five levels of increasing difficulty". However, it does not provide specific percentages, sample counts for train/validation/test splits, or reference to a defined splitting methodology for its experiments on these datasets. While Table 2 provides "Total training samples" for SFT, this relates to the model alignment process and not to the dataset splits for the jailbreak evaluation itself. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using GPT-4o for tasks like rewording questions but does not list specific software libraries, frameworks, or solvers with version numbers that are key dependencies for reproducing the experiments. |
| Experiment Setup | Yes | Table 2. SFT hyperparameters and data statistics for WMDP and GSM8K. Hyperparameter WMDP, 8B GSM8K, 8B WMDP, 70B GSM8K, 70B Learning rate 1 10 4 1 10 4 1 10 5 1 10 4 Batch size (per device) 2 16 2 16 Gradient accumulation steps 1 8 1 8 Number of epochs 3 1 1 1 FP16 True True True True Max sequence length 1024 1024 1024 1024 |