reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

h4rm3l: A Language for Composable Jailbreak Attack Synthesis

Authors: Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, Christopher Manning

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate h4rm3l s efficacy by synthesizing a dataset of 2656 successful novel jailbreak attacks targeting 6 SOTA open-source and proprietary LLMs (GPT-3.5, GPT-4o, Claude-3-Sonnet, Claude-3-Haiku, Llama-3-8B, and Llama-3-70B), and by benchmarking those models against a subset of these synthesized attacks. Our results show that h4rm3l s synthesized attacks are diverse and more successful than existing jailbreak attacks in literature, with success rates exceeding 90% on SOTA LLMs.
Researcher Affiliation	Academia	Moussa Koulako Bala Doumbouya Ananjan Nandi Gabriel Poesia Davide Ghilardi Anna Goldie Federico Bianchi Dan Jurafsky Christopher D. Manning Department of Computer Science, 353 Jane Stanford Way; Stanford, CA 94305 EMAIL
Pseudocode	Yes	Algorithm 1: Synthesize Programs(method, primitives, initial Examples, Dillicit, Niters)
Open Source Code	Yes	(v) Open-source automated black-box LLM redteaming software for synthesizing targeted attacks and benchmarking LLMs for safety. In our red-teaming experiments, h4rm3l generated several attacks exceeding 90% ASR against SOTA proprietary LLMs such as Anthropic s Claude-3-Sonnet, which previously had few known safety vulnerabilities, and Open AI s GPT-4o, which was very recently released. We also show that the most effective attacks targeting a given LLM are rarely as effective against the other LLMs, highlighting the need for targeted jailbreak attack synthesis methods such as h4rm3l.
Open Datasets	Yes	(iii) A dataset of 15,891 novel jailbreak attacks, including 2,656 attacks with estimated ASR between 40% and 100%, along with qualitative analysis showing their diversity and specificity to their target LLM. [...] Datasets of h4rm3l programs such as the ones we hereby release serve as basis for reproducible controlled experimentation and benchmarking.
Dataset Splits	Yes	killicit = 5 illicit prompts are sampled from the Adv Bench dataset to evaluate the ASR of each proposal. [...] We report ASR estimates over a set of 50 illicit prompts sampled from Adv Bench.
Hardware Specification	No	The paper does not explicitly mention specific hardware details such as CPU, GPU models, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions software components like Python (implicitly, as h4rm3l is embedded in Python), GPT-4, GPT-3.5, Llama 2 7B, Code BERT, and numpy. However, it does not provide specific version numbers for any of these libraries or tools, which is required for a reproducible description of software dependencies.
Experiment Setup	Yes	Our proposed program synthesis algorithms aim to maximize the ASR of synthesized programs targeting a particular LLM. In each iteration, an auxiliary LLM is prompted with kexamples = 15 few-shot examples of programs selected from a pool of examples to generate Nproposals = 20 novel proposals which are scored and recorded (See generate Proposals in Algorithm 1). killicit = 5 illicit prompts are sampled from the Adv Bench dataset to evaluate the ASR of each proposal. We compare three program synthesis approaches that only differ in their few-shot example selection methods. [...] This algorithm also uses the following hyperparameters: kexamples = 15 (few-shot examples sample size per iteration), killicit = 5 (illicit prompt sample size for ASR estimation), and λ, which scales the parameters of the Beta distribution P(s, y, λ) used by our ASR Rewarded Bandits method for example selection.