h4rm3l: A Language for Composable Jailbreak Attack Synthesis

Authors: Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, Christopher Manning

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate h4rm3l s efficacy by synthesizing a dataset of 2656 successful novel jailbreak attacks targeting 6 SOTA open-source and proprietary LLMs (GPT-3.5, GPT-4o, Claude-3-Sonnet, Claude-3-Haiku, Llama-3-8B, and Llama-3-70B), and by benchmarking those models against a subset of these synthesized attacks. Our results show that h4rm3l s synthesized attacks are diverse and more successful than existing jailbreak attacks in literature, with success rates exceeding 90% on SOTA LLMs.
Researcher Affiliation Academia Moussa Koulako Bala Doumbouya Ananjan Nandi Gabriel Poesia Davide Ghilardi Anna Goldie Federico Bianchi Dan Jurafsky Christopher D. Manning Department of Computer Science, 353 Jane Stanford Way; Stanford, CA 94305 EMAIL
Pseudocode Yes Algorithm 1: Synthesize Programs(method, primitives, initial Examples, Dillicit, Niters)
Open Source Code Yes (v) Open-source automated black-box LLM redteaming software for synthesizing targeted attacks and benchmarking LLMs for safety. In our red-teaming experiments, h4rm3l generated several attacks exceeding 90% ASR against SOTA proprietary LLMs such as Anthropic s Claude-3-Sonnet, which previously had few known safety vulnerabilities, and Open AI s GPT-4o, which was very recently released. We also show that the most effective attacks targeting a given LLM are rarely as effective against the other LLMs, highlighting the need for targeted jailbreak attack synthesis methods such as h4rm3l.
Open Datasets Yes (iii) A dataset of 15,891 novel jailbreak attacks, including 2,656 attacks with estimated ASR between 40% and 100%, along with qualitative analysis showing their diversity and specificity to their target LLM. [...] Datasets of h4rm3l programs such as the ones we hereby release serve as basis for reproducible controlled experimentation and benchmarking.
Dataset Splits Yes killicit = 5 illicit prompts are sampled from the Adv Bench dataset to evaluate the ASR of each proposal. [...] We report ASR estimates over a set of 50 illicit prompts sampled from Adv Bench.
Hardware Specification No The paper does not explicitly mention specific hardware details such as CPU, GPU models, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions software components like Python (implicitly, as h4rm3l is embedded in Python), GPT-4, GPT-3.5, Llama 2 7B, Code BERT, and numpy. However, it does not provide specific version numbers for any of these libraries or tools, which is required for a reproducible description of software dependencies.
Experiment Setup Yes Our proposed program synthesis algorithms aim to maximize the ASR of synthesized programs targeting a particular LLM. In each iteration, an auxiliary LLM is prompted with kexamples = 15 few-shot examples of programs selected from a pool of examples to generate Nproposals = 20 novel proposals which are scored and recorded (See generate Proposals in Algorithm 1). killicit = 5 illicit prompts are sampled from the Adv Bench dataset to evaluate the ASR of each proposal. We compare three program synthesis approaches that only differ in their few-shot example selection methods. [...] This algorithm also uses the following hyperparameters: kexamples = 15 (few-shot examples sample size per iteration), killicit = 5 (illicit prompt sample size for ASR estimation), and λ, which scales the parameters of the Beta distribution P(s, y, λ) used by our ASR Rewarded Bandits method for example selection.