reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models

Authors: Shicheng Xu, Liang Pang, Yunchang Zhu, Huawei Shen, Xueqi Cheng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that TGA not only successfully transfers the safety mechanism for text in basic LLMs to vision in vision-language alignment for LVLMs without any safety fine-tuning on the visual modality but also maintains the general performance on various vision tasks. Code is available. [...] Extensive experiments show that TGA successfully transfers the safety mechanism for text in basic LLMs to vision during vision-language alignment training for LVLMs without any safety fine-tuning on the visual modality. [...] Our TGA significantly improve the safety capabilities of LVLMs on toxic image than mainstream vision-language alignment method, without any additional safety fine-tuning on vision.
Researcher Affiliation	Collaboration	Shicheng Xu1,2, Liang Pang1 , Yunchang Zhu3, Huawei Shen1, Xueqi Cheng1 1CAS Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Huawei Inc. EMAIL,EMAIL
Pseudocode	Yes	Algorithm 1 Find the layer where safety mechanism is activated. 1: for j = 2 to N do Traverse word distribution change from 1 to N transformer layers. 2: if arg max Dj(x\|t, s) K Find the layer where sorry tokens start to rank Top-1 in word distribution change. 3: return j 4: end if 5: end for 6: return 1 If no such layer is found, return 1
Open Source Code	Yes	Code is available1. 1https://github.com/xsc1234/VLM_Safety_Transfer
Open Datasets	Yes	We collect real toxic images from open-source datasets. For each image, we use LLa VA-NEXT (Liu et al., 2024b) to generate caption for it to get the toxic text-image pair. Text and image in this pair have the same semantics but are in different modalities. The specific datasets include HOD (Ha et al., 2023) that contains 10, 631 toxic images about alcohol, cigarette, gun, insulting gesture and knife, and To Vi La G (Wang et al., 2023) that contains 9, 900 toxic images about bloody and porn. After the caption generation, we get 20, 531 toxic text-image pairs for experiments.
Dataset Splits	Yes	The training datasets for TGA are consistent wiht LLa VA (Liu et al., 2024c) that collects 558K images for pre-traing and 665K images for instruction-tuning. The pre-training dataset is a filtered subset of CC3M and the instruction-tuning dataset is LLa VA-1.5-mix665k.
Hardware Specification	Yes	Our model is training on 64 V100 GPUs with Deepspeed Zero-Stage 3 as acceleration framework in float32.
Software Dependencies	Yes	We use Mistral-7B-Instruct-v0.2 (Jiang et al., 2023) as basic LLM, clip-vitlarge-patch14-336 (Radford et al., 2021) as the vision tower and two-layer MLP as projector. Our model is training on 64 V100 GPUs with Deepspeed Zero-Stage 3 as acceleration framework in float32.
Experiment Setup	Yes	Most hyperparameters follow LLa VA (Liu et al., 2024c). In pre-training, we freeze basic LLMs and only train projector for 1 epoch with a learning rate of 2e-3. In instruction-tuning, we make all parameters trainable with a learning rate of 2e-6 for 1 epoch.