reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Authors: Yunhan Zhao, Xiang Zheng, Lin Luo, Yige Li, Xingjun Ma, Yu-Gang Jiang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show on four VLMs (LLa VA, Mini GPT-4, Instruction BLIP, and Gemini) and four safety benchmarks (Harmful Instruction, Adv Bench, MM-Safety Bench, and Red Team-2K) that Blue Suffix outperforms the baseline defenses by a significant margin.
Researcher Affiliation	Academia	1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2City University of Hong Kong 3Singapore Management University EMAIL EMAIL; {xiang.zheng}@cityu.edu.hk; {xdliyige}@gmail.com.
Pseudocode	Yes	Algorithm 1 Fine-Tuning the Blue-Team Suffix Generator
Open Source Code	Yes	Code is available at https://github.com/Vinsonzyh/Blue Suffix.
Open Datasets	Yes	We run our experiments on four popular safety benchmarks: Adv Bench (Zou et al., 2023), MM-Safety Bench (Liu et al., 2024c), Red Team-2K (Luo et al., 2024) and Harmful Instructions (Qi et al., 2024). Detailed introductions of the safety benchmarks are provided in the Appendix E.
Dataset Splits	No	The blue suffix generator is fine-tuned from a pre-trained GPT-2 using Proximal Policy Optimization (PPO) (Schulman et al., 2017) on hard jailbreak prompts crafted by the BAP attack (Ying et al., 2024) on all 13 jailbreak topics from the MM-Safety Bench. Please note that fine-tuned GPT-2 will be applied to defend other attacks (Img JP, VAA, GCG, and Auto DAN) and other datasets (Red Team-2K, Adv Bench, and Harmful Instructions) to test its generalizability. The fine-tuning batch size is set to 32. The reward given by the LLM judge (i.e., GPT-4o, the judge template is provided in the Appendix D) is 1 if the model s response is benign, 0 otherwise. The fine-tuning can be stopped until the expected safety score exceeds 0.95, for about 300 epochs.
Hardware Specification	No	The computations in this research were performed using the CFFF platform of Fudan University.
Software Dependencies	No	We fine-tune a GPT-2 model (Radford et al., 2019) for the suffix generator. ... We utilize GPT-4o (Achiam et al., 2023) to achieve the above objective with a rewritten template. As GPT-4o is a commercial model, we also test the open-source model Llama-3-8B-Instruct (AI@Meta, 2024) as the text purifier.
Experiment Setup	Yes	The fine-tuning batch size is set to 32. The reward given by the LLM judge (i.e., GPT-4o, the judge template is provided in the Appendix D) is 1 if the model s response is benign, 0 otherwise. The fine-tuning can be stopped until the expected safety score exceeds 0.95, for about 300 epochs.