h4rm3l: A Language for Composable Jailbreak Attack Synthesis
Authors: Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, Christopher Manning
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate h4rm3l s efficacy by synthesizing a dataset of 2656 successful novel jailbreak attacks targeting 6 SOTA open-source and proprietary LLMs (GPT-3.5, GPT-4o, Claude-3-Sonnet, Claude-3-Haiku, Llama-3-8B, and Llama-3-70B), and by benchmarking those models against a subset of these synthesized attacks. Our results show that h4rm3l s synthesized attacks are diverse and more successful than existing jailbreak attacks in literature, with success rates exceeding 90% on SOTA LLMs. |
| Researcher Affiliation | Academia | Moussa Koulako Bala Doumbouya Ananjan Nandi Gabriel Poesia Davide Ghilardi Anna Goldie Federico Bianchi Dan Jurafsky Christopher D. Manning Department of Computer Science, 353 Jane Stanford Way; Stanford, CA 94305 EMAIL |
| Pseudocode | Yes | Algorithm 1: Synthesize Programs(method, primitives, initial Examples, Dillicit, Niters) |
| Open Source Code | Yes | (v) Open-source automated black-box LLM redteaming software for synthesizing targeted attacks and benchmarking LLMs for safety. In our red-teaming experiments, h4rm3l generated several attacks exceeding 90% ASR against SOTA proprietary LLMs such as Anthropic s Claude-3-Sonnet, which previously had few known safety vulnerabilities, and Open AI s GPT-4o, which was very recently released. We also show that the most effective attacks targeting a given LLM are rarely as effective against the other LLMs, highlighting the need for targeted jailbreak attack synthesis methods such as h4rm3l. |
| Open Datasets | Yes | (iii) A dataset of 15,891 novel jailbreak attacks, including 2,656 attacks with estimated ASR between 40% and 100%, along with qualitative analysis showing their diversity and specificity to their target LLM. [...] Datasets of h4rm3l programs such as the ones we hereby release serve as basis for reproducible controlled experimentation and benchmarking. |
| Dataset Splits | Yes | killicit = 5 illicit prompts are sampled from the Adv Bench dataset to evaluate the ASR of each proposal. [...] We report ASR estimates over a set of 50 illicit prompts sampled from Adv Bench. |
| Hardware Specification | No | The paper does not explicitly mention specific hardware details such as CPU, GPU models, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like Python (implicitly, as h4rm3l is embedded in Python), GPT-4, GPT-3.5, Llama 2 7B, Code BERT, and numpy. However, it does not provide specific version numbers for any of these libraries or tools, which is required for a reproducible description of software dependencies. |
| Experiment Setup | Yes | Our proposed program synthesis algorithms aim to maximize the ASR of synthesized programs targeting a particular LLM. In each iteration, an auxiliary LLM is prompted with kexamples = 15 few-shot examples of programs selected from a pool of examples to generate Nproposals = 20 novel proposals which are scored and recorded (See generate Proposals in Algorithm 1). killicit = 5 illicit prompts are sampled from the Adv Bench dataset to evaluate the ASR of each proposal. We compare three program synthesis approaches that only differ in their few-shot example selection methods. [...] This algorithm also uses the following hyperparameters: kexamples = 15 (few-shot examples sample size per iteration), killicit = 5 (illicit prompt sample size for ASR estimation), and λ, which scales the parameters of the Beta distribution P(s, y, λ) used by our ASR Rewarded Bandits method for example selection. |