Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Authors: Maksym Andriushchenko, francesco croce, Nicolas Flammarion

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We measure the attack success rate for the leading safety-aligned LLMs on the set of 50 harmful requests from Adv Bench (Zou et al., 2023) curated by Chao et al. (2023). We consider an attack successful if GPT-4 as a semantic judge gives a 10/10 jailbreak score. In this work, we examine the safety of leading safety-aligned LLMs in terms of robustness to jailbreaks. We show that it is feasible to leverage the information available about each model, derived from training details or inference (e.g., logprobs), to construct simple adaptive attacks
Researcher Affiliation Academia Maksym Andriushchenko EPFL Francesco Croce EPFL Nicolas Flammarion EPFL
Pseudocode Yes Algorithm 1 Random Search for Adversarial Suffix Optimization. Require: Original request x, target token t (default: Sure ), suffix length L (default: 25), iterations N (default: 10 000) Ensure: Optimized adversarial suffix s
Open Source Code Yes For reproducibility purposes, we provide the code, logs, and jailbreak artifacts in the Jailbreak Bench format at https://github.com/tml-epfl/llm-adaptive-attacks.
Open Datasets Yes using the dataset of 50 harmful requests from Adv Bench (Zou et al., 2023) curated by Chao et al. (2023)
Dataset Splits Yes We optimize the trigger on batches of prompts from the available training set (we use only a small fraction of all training examples), and select the best performing trigger on an a validation set.
Hardware Specification Yes In terms of wall-clock time, 4000 iterations of random search on Llama-3-8B take 20.9 minutes on a single A100 GPU
Software Dependencies No No specific software versions are mentioned. The paper only mentions 'Hugging Face transformers' without a version number, which is insufficient for reproducibility.
Experiment Setup Yes Our main tool consists of a manually designed prompt template which is used for all unsafe requests for a given model enhanced by an adversarial suffix found with random search (Rastrigin, 1963) when the logprobs of the generated tokens are at least partially accessible... We use adversarial suffixes initialized with 25 tokens... we use up to 10 000 iterations and up to 10 random restarts, although in most cases a single restart suffices.