reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Authors: Maksym Andriushchenko, francesco croce, Nicolas Flammarion

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We measure the attack success rate for the leading safety-aligned LLMs on the set of 50 harmful requests from Adv Bench (Zou et al., 2023) curated by Chao et al. (2023). We consider an attack successful if GPT-4 as a semantic judge gives a 10/10 jailbreak score. In this work, we examine the safety of leading safety-aligned LLMs in terms of robustness to jailbreaks. We show that it is feasible to leverage the information available about each model, derived from training details or inference (e.g., logprobs), to construct simple adaptive attacks
Researcher Affiliation	Academia	Maksym Andriushchenko EPFL Francesco Croce EPFL Nicolas Flammarion EPFL
Pseudocode	Yes	Algorithm 1 Random Search for Adversarial Suffix Optimization. Require: Original request x, target token t (default: Sure ), suffix length L (default: 25), iterations N (default: 10 000) Ensure: Optimized adversarial suffix s
Open Source Code	Yes	For reproducibility purposes, we provide the code, logs, and jailbreak artifacts in the Jailbreak Bench format at https://github.com/tml-epfl/llm-adaptive-attacks.
Open Datasets	Yes	using the dataset of 50 harmful requests from Adv Bench (Zou et al., 2023) curated by Chao et al. (2023)
Dataset Splits	Yes	We optimize the trigger on batches of prompts from the available training set (we use only a small fraction of all training examples), and select the best performing trigger on an a validation set.
Hardware Specification	Yes	In terms of wall-clock time, 4000 iterations of random search on Llama-3-8B take 20.9 minutes on a single A100 GPU
Software Dependencies	No	No specific software versions are mentioned. The paper only mentions 'Hugging Face transformers' without a version number, which is insufficient for reproducibility.
Experiment Setup	Yes	Our main tool consists of a manually designed prompt template which is used for all unsafe requests for a given model enhanced by an adversarial suffix found with random search (Rastrigin, 1963) when the logprobs of the generated tokens are at least partially accessible... We use adversarial suffixes initialized with 25 tokens... we use up to 10 000 iterations and up to 10 random restarts, although in most cases a single restart suffices.