An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks

Authors: Valentyn Boreiko, Alexander Panfilov, Vaclav Voracek, Matthias Hein, Jonas Geiping

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental After an extensive comparison, we find attack success rates against safetytuned modern models to be lower than previously presented and that attacks based on discrete optimization significantly outperform recent LLMbased attacks.
Researcher Affiliation Academia 1University of T ubingen 2T ubingen AI Center 3Max Planck Institute for Intelligent Systems 4ELLIS Institute T ubingen.
Pseudocode Yes Algorithm 1 Adaptive GCG
Open Source Code Yes Our code is available at Git Hub.
Open Datasets Yes In Section 3, we construct a lightweight bigram LM from the Dolma dataset based on 1T tokens, which does not require any GPU RAM for inference.
Dataset Splits No We take a subset of Dolma (Soldaini et al., 2024), consisting of Mega Wika, Project Gutenberg, Stack Exchange, ar Xiv, Reddit, Star Coder, and Refined Web, which we split into Dtrain and Dval.
Hardware Specification No For PRS, GCG, and BEAST, all target models are loaded in float16. Due to GPU RAM constraints, both the target models and the auxiliary models specific to Auto Dan and PAIR are loaded in bfloat16.
Software Dependencies No We tokenize the data using the Llama2 tokenizer.
Experiment Setup Yes GCG (Zou et al., 2023) (xjailbreak = xmalicious s1:l). Adapting the original settings from Zou et al. (2023), we set (i) search width to 512 (ii) number of steps to 500, (iii) optimized suffix length to 20, (iv) early stopping loss to 0.05.