See What You Are Told: Visual Attention Sink in Large Multimodal Models

Authors: Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our findings show that this behavior arises due to the massive activation of certain hidden state dimensions, which resembles the attention sink found in language models. Hence, we refer to this phenomenon as the visual attention sink. In particular, our analysis reveals that removing the irrelevant visual sink tokens does not impact model performance, despite receiving high attention weights. Consequently, we recycle the attention to these tokens as surplus resources, redistributing the attention budget to enhance focus on the image. To achieve this, we introduce Visual Attention Redistribution (VAR), a method that redistributes attention in image-centric heads, which we identify as innately focusing on visual information. VAR can be seamlessly applied across different LMMs to improve performance on a wide range of tasks, including general visionlanguage tasks, visual hallucination tasks, and vision-centric tasks, all without the need for additional training, models, or inference steps. Experimental results demonstrate that VAR enables LMMs to process visual information more effectively by adjusting their internal attention mechanisms, offering a new direction to enhancing the multimodal capabilities of LMMs.
Researcher Affiliation Academia Seil Kang Jinyeong Kim Junhyeok Kim Seong Jae Hwang Yonsei University EMAIL
Pseudocode No The paper describes the Visual Attention Redistribution (VAR) method in Section 5 with steps for selecting image-centric heads and redistributing attention weights. However, these steps are described in narrative text with mathematical equations, not presented in a structured pseudocode or algorithm block.
Open Source Code Yes Our code is included in the supplementary material and all the source codes will be made available to the public.
Open Datasets Yes We evaluate our method on a wide range of vision-language benchmarks. The benchmarks are divided into three categories: general vision-language task, visual hallucination task, and vision-centric task. (1) General vision-language task assesses comprehensive multimodal capabilities of LMMs. We compare our method with the base models across 10 benchmarks. (2) Visual hallucination task evaluates whether the response of the model is consistent with the image content to ensure the trustworthiness and reliability of the model. We use CHAIR (Rohrbach et al., 2018), POPE (Li et al., 2023c), and MMHal-Bench (Sun et al., 2023). (3) Vision-centric task evaluates visual understanding capabilities, such as determining the spatial relationship between objects in the image. We use MMVP (Tong et al., 2024b), CV-Bench2D, and CV-Bench3D (Tong et al., 2024a). More details on the tasks and benchmarks are provided in the Appendix D.1.
Dataset Splits Yes GQA (Hudson & Manning, 2019): Question answering on image scene graphs. We used test-dev-balanced set splits for the evaluation. Viz Wiz (Gurari et al., 2018): We used val set splits for the evaluation. VQAT (Singh et al., 2019): Text VQA. We used val set splits for the evaluation.
Hardware Specification Yes All experiments and evaluations are conducted on a single NVIDIA Ge Force RTX A6000 48GB GPU.
Software Dependencies No The paper only specifies the version of GPT-4 used for evaluation: "We conducted evaluations using the gpt-4-0613 version, with a maximal output length set to 1024." It does not provide specific version numbers for other core software dependencies such as Python, PyTorch, or CUDA which are typically required for reproducibility of the experiments themselves.
Experiment Setup Yes We set τ = 20 and p = 0.6 for all experimental settings in our experiments. ρ is set to 0.8 for general vision-language task in Table 1, 0.5 for visual hallucination task in Table 2, and 0.9 for vision-centric task in Table 3. We do not modify the attention heads in the last layer, as the last layer is considered to have a specialized role (Lad et al., 2024; Sun et al., 2024b).