reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks

Authors: Valentyn Boreiko, Alexander Panfilov, Vaclav Voracek, Matthias Hein, Jonas Geiping

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	After an extensive comparison, we find attack success rates against safetytuned modern models to be lower than previously presented and that attacks based on discrete optimization significantly outperform recent LLMbased attacks.
Researcher Affiliation	Academia	1University of T ubingen 2T ubingen AI Center 3Max Planck Institute for Intelligent Systems 4ELLIS Institute T ubingen.
Pseudocode	Yes	Algorithm 1 Adaptive GCG
Open Source Code	Yes	Our code is available at Git Hub.
Open Datasets	Yes	In Section 3, we construct a lightweight bigram LM from the Dolma dataset based on 1T tokens, which does not require any GPU RAM for inference.
Dataset Splits	No	We take a subset of Dolma (Soldaini et al., 2024), consisting of Mega Wika, Project Gutenberg, Stack Exchange, ar Xiv, Reddit, Star Coder, and Refined Web, which we split into Dtrain and Dval.
Hardware Specification	No	For PRS, GCG, and BEAST, all target models are loaded in float16. Due to GPU RAM constraints, both the target models and the auxiliary models specific to Auto Dan and PAIR are loaded in bfloat16.
Software Dependencies	No	We tokenize the data using the Llama2 tokenizer.
Experiment Setup	Yes	GCG (Zou et al., 2023) (xjailbreak = xmalicious s1:l). Adapting the original settings from Zou et al. (2023), we set (i) search width to 512 (ii) number of steps to 500, (iii) optimized suffix length to 20, (iv) early stopping loss to 0.05.