TRUST-VLM: Thorough Red-Teaming for Uncovering Safety Threats in Vision-Language Models

Authors: Kangjie Chen, Li Muyang, Guanlin Li, Shudong Zhang, Shangwei Guo, Tianwei Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that TRUST-VLM not only outperforms traditional red-teaming techniques in generating diverse and effective adversarial cases but also provides actionable insights for model improvement. To evaluate the effectiveness of TRUST-VLM, we conduct comprehensive experiments on four open-source models (LLa VA-v1.5-13B, Qwen2-VL-7B, Deep Seek-VL-7B, and Phi-3-Vision-128K) and a commercial model (GPT-4o) with six harmful categories.
Researcher Affiliation Academia 1Digital Trust Center, Nanyang Technological University, Singapore; 2College of Computing and Data Science, Nanyang Technological University, Singapore; 3School of Computer Science and Technology, Xidian University, China; 4College of Computer Science, Chongqing University, China. Correspondence to: Shangwei Guo <EMAIL>.
Pseudocode No The paper describes the methodology in prose (Section 4) and provides ICL templates in Appendix F, but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper does not provide an explicit statement about releasing its own source code, nor does it include a link to a code repository.
Open Datasets Yes To evaluate the effectiveness of TRUST-VLM, we conduct comprehensive experiments on four open-source models (LLa VA-v1.5-13B, Qwen2-VL-7B, Deep Seek-VL-7B, and Phi-3-Vision-128K) and a commercial model (GPT-4o) with six harmful categories. We compare it against three types of baselines: automatic red-teaming method (Arondight (Liu et al., 2024)), benchmark-based red-teaming methods (Jailbreak V-28K (Luo et al., 2024) and Red-teaming VLM (RTVLM) (Li et al., 2024c)) and jailbreak attack (HADES (Li et al., 2024d)).
Dataset Splits No The paper describes the generation of test cases for evaluating target VLMs (e.g., 'We generate 200 test cases for each red-teaming method on LLa VA-v1.5-13B'), but it does not specify traditional training/test/validation dataset splits for a model being developed or trained within the paper.
Hardware Specification Yes During our experiments, we use 4 A6000 to launch the red-teaming pipeline.
Software Dependencies No The paper mentions specific models like 'Llama-3.1-70B-Instruct (Meta, 2024)', 'Stable Diffusion 3 Medium (Esser et al., 2024)', and 'BART-Large-MNLI (Facebook, 2024)', but it does not provide specific version numbers for general software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or other frameworks.
Experiment Setup Yes We introduce the detailed inference settings for VLMs and the red-teaming model in our framework in Table 8 and Table 9, respectively. A threshold parameter ϵ is introduced to filter out less confident results and prioritize reliable predictions. In the TRUST-VLM framework, this threshold is set to 0.75 by default. In our experiments, we use t = 2 by default to balance the information abundance and the implication of the harmful concept. For each aforementioned category, TRUST-VLM performs 50 rounds to generate test cases.