Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Authors: Maksym Andriushchenko, francesco croce, Nicolas Flammarion
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We measure the attack success rate for the leading safety-aligned LLMs on the set of 50 harmful requests from Adv Bench (Zou et al., 2023) curated by Chao et al. (2023). We consider an attack successful if GPT-4 as a semantic judge gives a 10/10 jailbreak score. In this work, we examine the safety of leading safety-aligned LLMs in terms of robustness to jailbreaks. We show that it is feasible to leverage the information available about each model, derived from training details or inference (e.g., logprobs), to construct simple adaptive attacks |
| Researcher Affiliation | Academia | Maksym Andriushchenko EPFL Francesco Croce EPFL Nicolas Flammarion EPFL |
| Pseudocode | Yes | Algorithm 1 Random Search for Adversarial Suffix Optimization. Require: Original request x, target token t (default: Sure ), suffix length L (default: 25), iterations N (default: 10 000) Ensure: Optimized adversarial suffix s |
| Open Source Code | Yes | For reproducibility purposes, we provide the code, logs, and jailbreak artifacts in the Jailbreak Bench format at https://github.com/tml-epfl/llm-adaptive-attacks. |
| Open Datasets | Yes | using the dataset of 50 harmful requests from Adv Bench (Zou et al., 2023) curated by Chao et al. (2023) |
| Dataset Splits | Yes | We optimize the trigger on batches of prompts from the available training set (we use only a small fraction of all training examples), and select the best performing trigger on an a validation set. |
| Hardware Specification | Yes | In terms of wall-clock time, 4000 iterations of random search on Llama-3-8B take 20.9 minutes on a single A100 GPU |
| Software Dependencies | No | No specific software versions are mentioned. The paper only mentions 'Hugging Face transformers' without a version number, which is insufficient for reproducibility. |
| Experiment Setup | Yes | Our main tool consists of a manually designed prompt template which is used for all unsafe requests for a given model enhanced by an adversarial suffix found with random search (Rastrigin, 1963) when the logprobs of the generated tokens are at least partially accessible... We use adversarial suffixes initialized with 25 tokens... we use up to 10 000 iterations and up to 10 random restarts, although in most cases a single restart suffices. |