BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Authors: Yunhan Zhao, Xiang Zheng, Lin Luo, Yige Li, Xingjun Ma, Yu-Gang Jiang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show on four VLMs (LLa VA, Mini GPT-4, Instruction BLIP, and Gemini) and four safety benchmarks (Harmful Instruction, Adv Bench, MM-Safety Bench, and Red Team-2K) that Blue Suffix outperforms the baseline defenses by a significant margin.
Researcher Affiliation Academia 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2City University of Hong Kong 3Singapore Management University EMAIL EMAIL; {xiang.zheng}@cityu.edu.hk; {xdliyige}@gmail.com.
Pseudocode Yes Algorithm 1 Fine-Tuning the Blue-Team Suffix Generator
Open Source Code Yes Code is available at https://github.com/Vinsonzyh/Blue Suffix.
Open Datasets Yes We run our experiments on four popular safety benchmarks: Adv Bench (Zou et al., 2023), MM-Safety Bench (Liu et al., 2024c), Red Team-2K (Luo et al., 2024) and Harmful Instructions (Qi et al., 2024). Detailed introductions of the safety benchmarks are provided in the Appendix E.
Dataset Splits No The blue suffix generator is fine-tuned from a pre-trained GPT-2 using Proximal Policy Optimization (PPO) (Schulman et al., 2017) on hard jailbreak prompts crafted by the BAP attack (Ying et al., 2024) on all 13 jailbreak topics from the MM-Safety Bench. Please note that fine-tuned GPT-2 will be applied to defend other attacks (Img JP, VAA, GCG, and Auto DAN) and other datasets (Red Team-2K, Adv Bench, and Harmful Instructions) to test its generalizability. The fine-tuning batch size is set to 32. The reward given by the LLM judge (i.e., GPT-4o, the judge template is provided in the Appendix D) is 1 if the model s response is benign, 0 otherwise. The fine-tuning can be stopped until the expected safety score exceeds 0.95, for about 300 epochs.
Hardware Specification No The computations in this research were performed using the CFFF platform of Fudan University.
Software Dependencies No We fine-tune a GPT-2 model (Radford et al., 2019) for the suffix generator. ... We utilize GPT-4o (Achiam et al., 2023) to achieve the above objective with a rewritten template. As GPT-4o is a commercial model, we also test the open-source model Llama-3-8B-Instruct (AI@Meta, 2024) as the text purifier.
Experiment Setup Yes The fine-tuning batch size is set to 32. The reward given by the LLM judge (i.e., GPT-4o, the judge template is provided in the Appendix D) is 1 if the model s response is benign, 0 otherwise. The fine-tuning can be stopped until the expected safety score exceeds 0.95, for about 300 epochs.