TRUST-VLM: Thorough Red-Teaming for Uncovering Safety Threats in Vision-Language Models
Authors: Kangjie Chen, Li Muyang, Guanlin Li, Shudong Zhang, Shangwei Guo, Tianwei Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that TRUST-VLM not only outperforms traditional red-teaming techniques in generating diverse and effective adversarial cases but also provides actionable insights for model improvement. To evaluate the effectiveness of TRUST-VLM, we conduct comprehensive experiments on four open-source models (LLa VA-v1.5-13B, Qwen2-VL-7B, Deep Seek-VL-7B, and Phi-3-Vision-128K) and a commercial model (GPT-4o) with six harmful categories. |
| Researcher Affiliation | Academia | 1Digital Trust Center, Nanyang Technological University, Singapore; 2College of Computing and Data Science, Nanyang Technological University, Singapore; 3School of Computer Science and Technology, Xidian University, China; 4College of Computer Science, Chongqing University, China. Correspondence to: Shangwei Guo <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in prose (Section 4) and provides ICL templates in Appendix F, but it does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code, nor does it include a link to a code repository. |
| Open Datasets | Yes | To evaluate the effectiveness of TRUST-VLM, we conduct comprehensive experiments on four open-source models (LLa VA-v1.5-13B, Qwen2-VL-7B, Deep Seek-VL-7B, and Phi-3-Vision-128K) and a commercial model (GPT-4o) with six harmful categories. We compare it against three types of baselines: automatic red-teaming method (Arondight (Liu et al., 2024)), benchmark-based red-teaming methods (Jailbreak V-28K (Luo et al., 2024) and Red-teaming VLM (RTVLM) (Li et al., 2024c)) and jailbreak attack (HADES (Li et al., 2024d)). |
| Dataset Splits | No | The paper describes the generation of test cases for evaluating target VLMs (e.g., 'We generate 200 test cases for each red-teaming method on LLa VA-v1.5-13B'), but it does not specify traditional training/test/validation dataset splits for a model being developed or trained within the paper. |
| Hardware Specification | Yes | During our experiments, we use 4 A6000 to launch the red-teaming pipeline. |
| Software Dependencies | No | The paper mentions specific models like 'Llama-3.1-70B-Instruct (Meta, 2024)', 'Stable Diffusion 3 Medium (Esser et al., 2024)', and 'BART-Large-MNLI (Facebook, 2024)', but it does not provide specific version numbers for general software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or other frameworks. |
| Experiment Setup | Yes | We introduce the detailed inference settings for VLMs and the red-teaming model in our framework in Table 8 and Table 9, respectively. A threshold parameter ϵ is introduced to filter out less confident results and prioritize reliable predictions. In the TRUST-VLM framework, this threshold is set to 0.75 by default. In our experiments, we use t = 2 by default to balance the information abundance and the implication of the harmful concept. For each aforementioned category, TRUST-VLM performs 50 rounds to generate test cases. |