Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models

Authors: Shicheng Xu, Liang Pang, Yunchang Zhu, Huawei Shen, Xueqi Cheng

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that TGA not only successfully transfers the safety mechanism for text in basic LLMs to vision in vision-language alignment for LVLMs without any safety fine-tuning on the visual modality but also maintains the general performance on various vision tasks. Code is available. [...] Extensive experiments show that TGA successfully transfers the safety mechanism for text in basic LLMs to vision during vision-language alignment training for LVLMs without any safety fine-tuning on the visual modality. [...] Our TGA significantly improve the safety capabilities of LVLMs on toxic image than mainstream vision-language alignment method, without any additional safety fine-tuning on vision.
Researcher Affiliation Collaboration Shicheng Xu1,2, Liang Pang1 , Yunchang Zhu3, Huawei Shen1, Xueqi Cheng1 1CAS Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Huawei Inc. EMAIL,EMAIL
Pseudocode Yes Algorithm 1 Find the layer where safety mechanism is activated. 1: for j = 2 to N do Traverse word distribution change from 1 to N transformer layers. 2: if arg max Dj(x|t, s) K Find the layer where sorry tokens start to rank Top-1 in word distribution change. 3: return j 4: end if 5: end for 6: return 1 If no such layer is found, return 1
Open Source Code Yes Code is available1. 1https://github.com/xsc1234/VLM_Safety_Transfer
Open Datasets Yes We collect real toxic images from open-source datasets. For each image, we use LLa VA-NEXT (Liu et al., 2024b) to generate caption for it to get the toxic text-image pair. Text and image in this pair have the same semantics but are in different modalities. The specific datasets include HOD (Ha et al., 2023) that contains 10, 631 toxic images about alcohol, cigarette, gun, insulting gesture and knife, and To Vi La G (Wang et al., 2023) that contains 9, 900 toxic images about bloody and porn. After the caption generation, we get 20, 531 toxic text-image pairs for experiments.
Dataset Splits Yes The training datasets for TGA are consistent wiht LLa VA (Liu et al., 2024c) that collects 558K images for pre-traing and 665K images for instruction-tuning. The pre-training dataset is a filtered subset of CC3M and the instruction-tuning dataset is LLa VA-1.5-mix665k.
Hardware Specification Yes Our model is training on 64 V100 GPUs with Deepspeed Zero-Stage 3 as acceleration framework in float32.
Software Dependencies Yes We use Mistral-7B-Instruct-v0.2 (Jiang et al., 2023) as basic LLM, clip-vitlarge-patch14-336 (Radford et al., 2021) as the vision tower and two-layer MLP as projector. Our model is training on 64 V100 GPUs with Deepspeed Zero-Stage 3 as acceleration framework in float32.
Experiment Setup Yes Most hyperparameters follow LLa VA (Liu et al., 2024c). In pre-training, we freeze basic LLMs and only train projector for 1 epoch with a learning rate of 2e-3. In instruction-tuning, we make all parameters trainable with a learning rate of 2e-6 for 1 epoch.