Endless Jailbreaks with Bijection Learning
Authors: Brian R.Y. Huang, Max Li, Leonard Tang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate bijection learning on a range of frontier models, including GPT and Claude models, and achieve state-of-the-art Attack Success Rate (ASR) measurements across models and attack datasets. Our attack adapts smoothly to model scale and shows a concrete trend of increasing efficacy with model strength. [...] 3 EXPERIMENTS [...] We report ASRs for bijection learning on frontier models: Claude 3 Haiku, Claude 3 Opus, Claude 3.5 Sonnet, GPT-4o-mini, and GPT-4o. We use the Adv Bench-50 (Chao et al., 2023) and Harm Bench (Mazeika et al., 2024) datasets of harmful attack intents. |
| Researcher Affiliation | Industry | Brian R.Y. Huang , Maximilian Li & Leonard Tang Haize Labs |
| Pseudocode | No | The paper describes the bijection learning method in Section 2, detailing steps in prose. However, it does not present these steps in a formally structured pseudocode block or algorithm environment. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for their methodology is publicly available. The Limitations section mentions, "It remains to be seen whether bijection learning jailbreaks can be achieved zero-shot, with fewer input tokens, on open-source models finetuned on bijection language learning examples.", implying their current work does not include open-source code release. |
| Open Datasets | Yes | We use the Adv Bench-50 (Chao et al., 2023) and Harm Bench (Mazeika et al., 2024) datasets of harmful attack intents. |
| Dataset Splits | Yes | We report ASRs on the Adv Bench-50 set and on the full Harm Bench test set for a suite of frontier models. [...] We sample a subset of 35 intents from Harm Bench with 5 intents from each risk category (Harm Bench-35). |
| Hardware Specification | No | The paper mentions running evaluations on models but does not specify any hardware details such as GPU models, CPU types, or memory specifications used for these evaluations. |
| Software Dependencies | No | The paper mentions using specific models like "GPT-4o-mini" as classifiers or targets but does not provide specific version numbers for any software libraries, frameworks, or programming languages used to implement their methodology. |
| Experiment Setup | Yes | We evaluate bijection learning with best-of-n sampling by selecting an attack budget n after which the ASR tapers off. [...] Table 1: We report ASRs on the Adv Bench-50 set and on the full Harm Bench test set for a suite of frontier models. For the ensemble baseline, we group together 11 previous encoding-based attacks and mark the ensemble of methods successful if any single attack succeeded for an intent. [...] We define the dispersion d of a bijection as the number of letters that do not map to themselves. [...] We define the encoding length ℓof a bijection as the number of letters or numbers in each sequence in the codomain. [...] we evaluate the LLM on J(x) by generating a single response with temperature 0. [...] We select a fixed sequence of 10 translation examples, so our prompt template is deterministic up to the random bijective mapping (see Appendix A). |