reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

Authors: Kristina Nikolić, Luze Sun, Jie Zhang, Florian Tramèr

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the jailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy.
Researcher Affiliation	Academia	1ETH Zurich 2University of Pennsylvania work done on a ETH Student Research Fellowship. Correspondence to: Kristina Nikoli c <EMAIL>.
Pseudocode	No	The paper describes methods and procedures in paragraph text and refers to existing algorithms by name (e.g., GCG, Auto DAN, PAIR, TAP), but does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Overall, our work proposes the jailbreak tax as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at https://github. com/ethz-spylab/jailbreak-tax
Open Datasets	Yes	We test the model performance on 1000 bio-security questions from the Weapons of Mass Destruction Proxy (WMDP) dataset (Li et al., 2024). We primarily make use of 1000 questions from GSM8K dataset of grade school math word problems (Cobbe et al., 2021). In some of our experiments, we also use the MATH dataset (Hendrycks et al., 2020) of competition mathematics problems, split into five levels of increasing difficulty from 1 to 5.
Dataset Splits	No	The paper mentions using "1000 questions from GSM8K dataset" and "1000 bio-security questions from the Weapons of Mass Destruction Proxy (WMDP) dataset" for evaluation, and that the MATH dataset is "split into five levels of increasing difficulty". However, it does not provide specific percentages, sample counts for train/validation/test splits, or reference to a defined splitting methodology for its experiments on these datasets. While Table 2 provides "Total training samples" for SFT, this relates to the model alignment process and not to the dataset splits for the jailbreak evaluation itself.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used for running its experiments.
Software Dependencies	No	The paper mentions using GPT-4o for tasks like rewording questions but does not list specific software libraries, frameworks, or solvers with version numbers that are key dependencies for reproducing the experiments.
Experiment Setup	Yes	Table 2. SFT hyperparameters and data statistics for WMDP and GSM8K. Hyperparameter WMDP, 8B GSM8K, 8B WMDP, 70B GSM8K, 70B Learning rate 1 10 4 1 10 4 1 10 5 1 10 4 Batch size (per device) 2 16 2 16 Gradient accumulation steps 1 8 1 8 Number of epochs 3 1 1 1 FP16 True True True True Max sequence length 1024 1024 1024 1024