Towards Interpreting Visual Information Processing in Vision-Language Models

Authors: Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, Fazl Barez

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through ablation studies, we demonstrated that object identification accuracy drops by over 70% when object-specific tokens are removed. We observed that visual token representations become increasingly interpretable in the vocabulary space across layers, suggesting an alignment with textual tokens corresponding to image content. Finally, we found that the model extracts object information from these refined representations at the last token position for prediction, mirroring the process in text-only language models for factual association tasks. These findings provide crucial insights into how VLMs process and integrate visual information, bridging the gap between our understanding of language and vision models, and paving the way for more interpretable and controllable multimodal systems.
Researcher Affiliation Collaboration Clement Neo1 , Luke Ong1, Philip Torr2, Mor Geva3, David Krueger4, Fazl Barez2,5 1Nanyang Technological University 2University of Oxford 3Tel Aviv University 4MILA 5Tangentic
Pseudocode No The paper describes methods like 'ablation experiments' and 'attention knockout technique' in narrative text and mathematical formulas (e.g., Equation 1, 2, 3), but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our findings are first steps in understanding the internal mechanisms of VLMs, paving the way for more interpretable and controllable multimodal systems. The code for our experiments is available at https://github.com/clemneo/llava-interp.
Open Datasets Yes Dataset. We use images from the COCO Detection Training set (Lin et al., 2014). To ensure the reliability of our results, we employ two filtering steps:
Dataset Splits Yes After applying both filtering steps, our final dataset comprises 4,318 images." and "We manually curate a set of 100 images with specific questions about objects" and "To quantify this phenomenon, we analyzed 170 COCO validation images with objects of sizes between 20,000 and 30,000 square pixels (approximately 1/2 width and 1/3 height of the image).
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. It only vaguely mentions 'providing compute'.
Software Dependencies No The paper mentions several models like LLaVA 1.5, LLaVA-Phi1, Qwen2-VL, CLIP, and Vicuna 13B, but it does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used in the experiments.
Experiment Setup Yes For each ablation experiment, we create a modified set of embeddings E A: e i = e if i S ei otherwise where e = 1 N PN i=1 ei is the mean embedding across all visual tokens from 50,000 images from the Image Net validation split (Deng et al., 2009). This replaces the hypothesized object-relevant tokens with an average token, effectively ablating their specific information. We do mean ablation to preserve the norm of the image token and keep them in-distribution, as their norms are typically much higher than the norm of text tokens (Bailey et al., 2023)." and "We employ the attention knockout technique introduced by Geva et al. (2023). This method involves selectively blocking attention between specific tokens at different layers of the model, allowing us to assess the importance of various connections for the model to identify the object. ... We apply this blocking over a window of consecutive layers, including early layers (L1-10), early to middle layers (L5-14), middle layers (L11-20), middle to late layers (L15-24), and late layers (L21-31).