BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
Authors: Yunhan Zhao, Xiang Zheng, Lin Luo, Yige Li, Xingjun Ma, Yu-Gang Jiang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show on four VLMs (LLa VA, Mini GPT-4, Instruction BLIP, and Gemini) and four safety benchmarks (Harmful Instruction, Adv Bench, MM-Safety Bench, and Red Team-2K) that Blue Suffix outperforms the baseline defenses by a significant margin. |
| Researcher Affiliation | Academia | 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2City University of Hong Kong 3Singapore Management University EMAIL EMAIL; {xiang.zheng}@cityu.edu.hk; {xdliyige}@gmail.com. |
| Pseudocode | Yes | Algorithm 1 Fine-Tuning the Blue-Team Suffix Generator |
| Open Source Code | Yes | Code is available at https://github.com/Vinsonzyh/Blue Suffix. |
| Open Datasets | Yes | We run our experiments on four popular safety benchmarks: Adv Bench (Zou et al., 2023), MM-Safety Bench (Liu et al., 2024c), Red Team-2K (Luo et al., 2024) and Harmful Instructions (Qi et al., 2024). Detailed introductions of the safety benchmarks are provided in the Appendix E. |
| Dataset Splits | No | The blue suffix generator is fine-tuned from a pre-trained GPT-2 using Proximal Policy Optimization (PPO) (Schulman et al., 2017) on hard jailbreak prompts crafted by the BAP attack (Ying et al., 2024) on all 13 jailbreak topics from the MM-Safety Bench. Please note that fine-tuned GPT-2 will be applied to defend other attacks (Img JP, VAA, GCG, and Auto DAN) and other datasets (Red Team-2K, Adv Bench, and Harmful Instructions) to test its generalizability. The fine-tuning batch size is set to 32. The reward given by the LLM judge (i.e., GPT-4o, the judge template is provided in the Appendix D) is 1 if the model s response is benign, 0 otherwise. The fine-tuning can be stopped until the expected safety score exceeds 0.95, for about 300 epochs. |
| Hardware Specification | No | The computations in this research were performed using the CFFF platform of Fudan University. |
| Software Dependencies | No | We fine-tune a GPT-2 model (Radford et al., 2019) for the suffix generator. ... We utilize GPT-4o (Achiam et al., 2023) to achieve the above objective with a rewritten template. As GPT-4o is a commercial model, we also test the open-source model Llama-3-8B-Instruct (AI@Meta, 2024) as the text purifier. |
| Experiment Setup | Yes | The fine-tuning batch size is set to 32. The reward given by the LLM judge (i.e., GPT-4o, the judge template is provided in the Appendix D) is 1 if the model s response is benign, 0 otherwise. The fine-tuning can be stopped until the expected safety score exceeds 0.95, for about 300 epochs. |