reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Endless Jailbreaks with Bijection Learning

Authors: Brian R.Y. Huang, Max Li, Leonard Tang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate bijection learning on a range of frontier models, including GPT and Claude models, and achieve state-of-the-art Attack Success Rate (ASR) measurements across models and attack datasets. Our attack adapts smoothly to model scale and shows a concrete trend of increasing efficacy with model strength. [...] 3 EXPERIMENTS [...] We report ASRs for bijection learning on frontier models: Claude 3 Haiku, Claude 3 Opus, Claude 3.5 Sonnet, GPT-4o-mini, and GPT-4o. We use the Adv Bench-50 (Chao et al., 2023) and Harm Bench (Mazeika et al., 2024) datasets of harmful attack intents.
Researcher Affiliation	Industry	Brian R.Y. Huang , Maximilian Li & Leonard Tang Haize Labs
Pseudocode	No	The paper describes the bijection learning method in Section 2, detailing steps in prose. However, it does not present these steps in a formally structured pseudocode block or algorithm environment.
Open Source Code	No	The paper does not provide any statement or link indicating that the source code for their methodology is publicly available. The Limitations section mentions, "It remains to be seen whether bijection learning jailbreaks can be achieved zero-shot, with fewer input tokens, on open-source models finetuned on bijection language learning examples.", implying their current work does not include open-source code release.
Open Datasets	Yes	We use the Adv Bench-50 (Chao et al., 2023) and Harm Bench (Mazeika et al., 2024) datasets of harmful attack intents.
Dataset Splits	Yes	We report ASRs on the Adv Bench-50 set and on the full Harm Bench test set for a suite of frontier models. [...] We sample a subset of 35 intents from Harm Bench with 5 intents from each risk category (Harm Bench-35).
Hardware Specification	No	The paper mentions running evaluations on models but does not specify any hardware details such as GPU models, CPU types, or memory specifications used for these evaluations.
Software Dependencies	No	The paper mentions using specific models like "GPT-4o-mini" as classifiers or targets but does not provide specific version numbers for any software libraries, frameworks, or programming languages used to implement their methodology.
Experiment Setup	Yes	We evaluate bijection learning with best-of-n sampling by selecting an attack budget n after which the ASR tapers off. [...] Table 1: We report ASRs on the Adv Bench-50 set and on the full Harm Bench test set for a suite of frontier models. For the ensemble baseline, we group together 11 previous encoding-based attacks and mark the ensemble of methods successful if any single attack succeeded for an intent. [...] We define the dispersion d of a bijection as the number of letters that do not map to themselves. [...] We define the encoding length ℓof a bijection as the number of letters or numbers in each sequence in the codomain. [...] we evaluate the LLM on J(x) by generating a single response with temperature 0. [...] We select a fixed sequence of 10 translation examples, so our prompt template is deterministic up to the random bijective mapping (see Appendix A).