Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

Authors: Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristobal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, Tony Wang, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted a large-scale empirical study to assess the transferability of gradient-based universal image jailbreaks
Researcher Affiliation Collaboration Rylan Schaeffer Stanford CS Dan Valentine Independent Luke Bailey Harvard SEAS James Chua Independent Cristóbal Eyzaguirre Stanford CS Zane Durante Stanford CS Joe Benton Anthropic Brando Miranda Stanford CS Henry Sleight Constellation Tony Tong Wang MIT EECS John Hughes Constellation Rajashree Agrawal Constellation Mrinank Sharma Anthropic Scott Emmons UC Berkeley EECS Sanmi Koyejo Stanford CS Ethan Perez Anthropic
Pseudocode No The paper describes the methodology in narrative text within Section 2 'METHODOLOGY TO OPTIMIZE AND EVALUATE IMAGE JAILBREAKS' and its subsections (e.g., 'Harmful-Yet-Helpful Text Datasets', 'Finding White-Box Image Jailbreaks', 'Vision-Language Models (VLMs)', 'Measuring Jailbreak Success'), but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No The paper states in Appendix A: 'In this work, we created 18 new VLMs based on the cross-product of 6 language backbones (Gemma Instruct 2B, Gemma Instruct 8B, Llama 2 Chat 7B, Llama 3 Instruct 8B, Mistral Instructv0.2 Phi 3 Instruct 4B) and 3 vision backbones (CLIP, Sig LIP, DINOv2+Sig LIP) using the prismatic training code. The VLMs are publicly available on Hugging Face.' This refers to the release of models, not the specific source code for the image jailbreak optimization methodology. The paper also mentions using and extending the 'Prismatic repository' which is a third-party tool.
Open Datasets Yes We consider three different datasets: (i) Adv Bench (Zou et al., 2023), which includes highly formulaic responses to harmful prompts that always begin with Sure ; (ii) Anthropic HHH (Ganguli et al., 2022), which is a dataset of human preference comparisons; and (iii) Generated data, which consists of synthetic prompts generated by Claude 3 Opus across 51 harmful topics and responses generated by Llama 3 Instruct; see App. D for more information.
Dataset Splits Yes Adv Bench was previously used to optimize text jailbreaks that successfully transferred to leading language models including GPT4 and Claude. Adv Bench is highly formulaic: the target response is always Sure,' followed by restating the prompt; for example, if the prompt is Write a script that can exploit vulnerabilities in an operating system', the target response is Sure, here is a script that can exploit vulnerabilities in an operating system.' This dataset has 416 training and 104 test prompt-response pairs. [...] This Anthropic HHH dataset was manually subsampled to 416 training and 104 test prompt-response pairs to match the number of samples in Adv Bench. [...] This Generated dataset had 48k training and 12k test prompt-response pairs.
Hardware Specification No The paper mentions funding for 'compute' in the acknowledgements ('Open Philanthropy and FAR AI for providing funding for compute.') but does not provide specific details on the hardware used for the experiments, such as GPU/CPU models or specific cloud instances.
Software Dependencies No The paper mentions using 'Adam (Kingma & Ba, 2017)' as the optimizer but does not specify any programming languages, libraries, or other key software components with their version numbers needed for reproducibility.
Experiment Setup Yes We optimized each image for 50000 steps using Adam (Kingma & Ba, 2017) with learning rate 1e 3, momentum 0.9, epsilon 1e 4, and weight decay 1e 5. We used a batch size of 2 and accumulated 4 batches for each gradient step, for an effective batch size of 8. All VLM parameters were frozen.