Endless Jailbreaks with Bijection Learning

Authors: Brian R.Y. Huang, Max Li, Leonard Tang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate bijection learning on a range of frontier models, including GPT and Claude models, and achieve state-of-the-art Attack Success Rate (ASR) measurements across models and attack datasets. Our attack adapts smoothly to model scale and shows a concrete trend of increasing efficacy with model strength. [...] 3 EXPERIMENTS [...] We report ASRs for bijection learning on frontier models: Claude 3 Haiku, Claude 3 Opus, Claude 3.5 Sonnet, GPT-4o-mini, and GPT-4o. We use the Adv Bench-50 (Chao et al., 2023) and Harm Bench (Mazeika et al., 2024) datasets of harmful attack intents.
Researcher Affiliation Industry Brian R.Y. Huang , Maximilian Li & Leonard Tang Haize Labs
Pseudocode No The paper describes the bijection learning method in Section 2, detailing steps in prose. However, it does not present these steps in a formally structured pseudocode block or algorithm environment.
Open Source Code No The paper does not provide any statement or link indicating that the source code for their methodology is publicly available. The Limitations section mentions, "It remains to be seen whether bijection learning jailbreaks can be achieved zero-shot, with fewer input tokens, on open-source models finetuned on bijection language learning examples.", implying their current work does not include open-source code release.
Open Datasets Yes We use the Adv Bench-50 (Chao et al., 2023) and Harm Bench (Mazeika et al., 2024) datasets of harmful attack intents.
Dataset Splits Yes We report ASRs on the Adv Bench-50 set and on the full Harm Bench test set for a suite of frontier models. [...] We sample a subset of 35 intents from Harm Bench with 5 intents from each risk category (Harm Bench-35).
Hardware Specification No The paper mentions running evaluations on models but does not specify any hardware details such as GPU models, CPU types, or memory specifications used for these evaluations.
Software Dependencies No The paper mentions using specific models like "GPT-4o-mini" as classifiers or targets but does not provide specific version numbers for any software libraries, frameworks, or programming languages used to implement their methodology.
Experiment Setup Yes We evaluate bijection learning with best-of-n sampling by selecting an attack budget n after which the ASR tapers off. [...] Table 1: We report ASRs on the Adv Bench-50 set and on the full Harm Bench test set for a suite of frontier models. For the ensemble baseline, we group together 11 previous encoding-based attacks and mark the ensemble of methods successful if any single attack succeeded for an intent. [...] We define the dispersion d of a bijection as the number of letters that do not map to themselves. [...] We define the encoding length ℓof a bijection as the number of letters or numbers in each sequence in the codomain. [...] we evaluate the LLM on J(x) by generating a single response with temperature 0. [...] We select a fixed sequence of 10 translation examples, so our prompt template is deterministic up to the random bijective mapping (see Appendix A).