How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?
Authors: Seongyun Lee, Geewook Kim, Jiyeon Kim, Hyunji Lee, Hoyeon Chang, Sue Park, Minjoon Seo
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a series of experiments to identify that safety degradation during VL adaptation stems from the adaptation process itself, not just the quality of training data. We assess existing safety tuning methods (safety SFT and RLHF) through comprehensive evaluations and find them lacking, either reducing the model s helpfulness or failing to ensure complete safety. Our experimental results validate the proposed method, and we provide openly accessible models and code to support further research. |
| Researcher Affiliation | Collaboration | KAIST AI1 NAVER Cloud AI2 |
| Pseudocode | No | The paper describes methods in prose and does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our experimental results validate the proposed method, and we provide openly accessible models and code to support further research. |
| Open Datasets | Yes | We perform VL adaptation on both LLa MA-2 Chat and Tulu-2 using the LLa VA-Pretrain and LLa VA-Instruct datasets (Liu et al., 2023b;a)... We utilize the VLGuard (Zong et al., 2024) dataset, a multimodal safety tuning dataset... We apply the Direct Preference Optimization (DPO) method (Rafailov et al., 2024) to Tulu-2-VL using text-only safety-focused preference data from Safe RLHF (Dai et al., 2023) and multimodal preference data from SPA-VL (Zhang et al., 2024)... For text-only safety, we use Sorry Bench (Xie et al., 2024) and Wild Jailbreak (Jiang et al., 2024), while for multimodal safety, we utilize MMSafetybench (Liu et al., 2023c), SIUO (Wang et al., 2024), and Figstep (Gong et al., 2023a)... For multimodal helpfulness benchmarks, we use MMBench (Liu et al., 2023d), MME (Fu et al., 2024), and SEEDBench (Li et al., 2023a). |
| Dataset Splits | Yes | We perform VL adaptation on both LLa MA-2 Chat and Tulu-2 using the LLa VA-Pretrain and LLa VA-Instruct datasets (Liu et al., 2023b;a)... In the MTL approach, we create LLa MA-2-Chat-VL-MTL by combining LLa VA-Instruct and VLGuard into a single training dataset and conducting supervised fine-tuning on LLa MA-2 Chat 7B... For text-only safety, we use Sorry Bench (Xie et al., 2024) and Wild Jailbreak (Jiang et al., 2024), while for multimodal safety, we utilize MMSafetybench (Liu et al., 2023c), SIUO (Wang et al., 2024), and Figstep (Gong et al., 2023a). |
| Hardware Specification | Yes | For model training, we use four NVIDIA H100 80GB GPUs, and for evaluation, we employ four NVIDIA A100 80GB GPUs. The CPU used is the AMD EPYC 7763 64-Core Processor, featuring 64 cores, a CPU speed of 1497.674 MHz, and a cache size of 512KB. |
| Software Dependencies | No | We use the LLa VA codebase3 for model training and the VLLM library4 for evaluation... We use bfloat16 (bf16) precision... The paper does not specify version numbers for key software components or libraries. |
| Experiment Setup | Yes | VL training: We use bfloat16 (bf16) precision, set the number of training epochs to 1, and configure the training batch size to 16 samples per device, resulting in a global batch size of 64... The learning rate is set to 2e-5 without applying weight decay. We utilize a warm-up phase covering 3% of the total training steps and employ a cosine learning rate scheduler... The maximum sequence length for the model input is set to 2048 tokens. Safety DPO: we use bfloat16 (bf16) mixed precision, set train for three epochs, and configure the training batch size to 1 samples per device, resulting in a global batch size of 32. The learning rate is set to 5e-7, linearly decaying to 0 with warm-up period of 0.1. The maximum sequence length for the model input is set to 2048 tokens. Evaluation: The sampling temperature... is set to 0.1... The maximum number of tokens generated in each output is limited to 512. The frequency penalty... is set to 0.0, while the repetition penalty... is set to 1.0... We set the top-p parameter to 1.0... the length penalty... is set to 1.0. |