Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
Authors: Yi Ding, Bolian Li, Ruqi Zhang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency, reducing the unsafe rate by 87.5% in cross-modality attacks and achieving 96.6% win-ties in GPT-4 helpfulness evaluation. The code is publicly available at https://github.com/Drip Nowhy/ETA. Through extensive experiments, we validated the effectiveness of the ETA framework across multiple dimensions: harmlessness, helpfulness, and preservation of general abilities. Our experiments also contribute insights into the interplay between different VLM components and their combined impact on model safety and performance. |
| Researcher Affiliation | Academia | Yi Ding, Bolian Li, Ruqi Zhang Department of Computer Science, Purdue University, USA EMAIL |
| Pseudocode | Yes | Algorithm 1: Evaluating Then Aligning (ETA) |
| Open Source Code | Yes | The code is publicly available at https://github.com/Drip Nowhy/ETA. |
| Open Datasets | Yes | We randomly selected 100 harmful and safe images from the MM-Safetybench (Liu et al., 2023a) and COCO datasets (Lin et al., 2014), respectively. SPA-VL (Zhang et al., 2024c) is a multimodal comprehensive safety preference alignment dataset. MM-Safety Bench (Liu et al., 2023a) is a multimodal safety benchmark primarily focused on image-based attacks. Fig Step (Gong et al., 2023) highlights that VLMs are vulnerable to harmful image-based attacks. Adv Bench (Zou et al., 2023) is a commonly used pure-text safety dataset. MME (Fu et al., 2023) is a multimodal comprehensive benchmark. MMBench (Liu et al., 2023b) evaluates 20 fundamental capabilities of VLMs. Science QA (Lu et al., 2022) primarily evaluates language models capabilities in the domain of science. Text VQA (Singh et al., 2019) assesses a model s understanding and reasoning capabilities in relation to Optical Character Recognition (OCR). VQAv2 (Goyal et al., 2017a) contains open-ended questions related to images. MMMU-Pro (Yue et al., 2024b) is a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark (Yue et al., 2024a). |
| Dataset Splits | Yes | To validate the efficacy of the CLIP score, we randomly selected 100 harmful and safe images from the MM-Safetybench (Liu et al., 2023a) and COCO datasets (Lin et al., 2014), respectively. The test set consists of 530 data points, with 265 labeled as Harm and 265 labeled as Help, specifically designed to evaluate the model s safety and helpfulness capabilities. For VLSafe, we randomly sampled 100 data points for testing. |
| Hardware Specification | Yes | All experiments were conducted on an NVIDIA RTX A6000 platform. |
| Software Dependencies | No | The paper lists several VLM backbones (LLa VA-1.5-7B and 13B, LLa VA-Ne XT-8B, LLa VA-One Vision-7B-Chat, Intern VL-Chat-1.0-7B, Intern LM-XComposer-2.5-7B, and Llama3.2-11B-Vision Instruct) and a textual RM (Armo RMLlama3-8B-v0.1) with their names/versions, but does not specify ancillary software like programming languages (e.g., Python), libraries (e.g., PyTorch), or CUDA versions required to replicate the experiments. |
| Experiment Setup | Yes | For our ETA method, during the evaluation phase, we empirically set the thresholds to ฯpre = 0.16 in Eq. 3 and ฯpost = 0.06 in Eq. 4. In the alignment phase, we generated N = 5 candidate responses per sentence. |