An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks
Authors: Valentyn Boreiko, Alexander Panfilov, Vaclav Voracek, Matthias Hein, Jonas Geiping
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | After an extensive comparison, we find attack success rates against safetytuned modern models to be lower than previously presented and that attacks based on discrete optimization significantly outperform recent LLMbased attacks. |
| Researcher Affiliation | Academia | 1University of T ubingen 2T ubingen AI Center 3Max Planck Institute for Intelligent Systems 4ELLIS Institute T ubingen. |
| Pseudocode | Yes | Algorithm 1 Adaptive GCG |
| Open Source Code | Yes | Our code is available at Git Hub. |
| Open Datasets | Yes | In Section 3, we construct a lightweight bigram LM from the Dolma dataset based on 1T tokens, which does not require any GPU RAM for inference. |
| Dataset Splits | No | We take a subset of Dolma (Soldaini et al., 2024), consisting of Mega Wika, Project Gutenberg, Stack Exchange, ar Xiv, Reddit, Star Coder, and Refined Web, which we split into Dtrain and Dval. |
| Hardware Specification | No | For PRS, GCG, and BEAST, all target models are loaded in float16. Due to GPU RAM constraints, both the target models and the auxiliary models specific to Auto Dan and PAIR are loaded in bfloat16. |
| Software Dependencies | No | We tokenize the data using the Llama2 tokenizer. |
| Experiment Setup | Yes | GCG (Zou et al., 2023) (xjailbreak = xmalicious s1:l). Adapting the original settings from Zou et al. (2023), we set (i) search width to 512 (ii) number of steps to 500, (iii) optimized suffix length to 20, (iv) early stopping loss to 0.05. |