reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Interpreting Visual Information Processing in Vision-Language Models

Authors: Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, Fazl Barez

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through ablation studies, we demonstrated that object identification accuracy drops by over 70% when object-specific tokens are removed. We observed that visual token representations become increasingly interpretable in the vocabulary space across layers, suggesting an alignment with textual tokens corresponding to image content. Finally, we found that the model extracts object information from these refined representations at the last token position for prediction, mirroring the process in text-only language models for factual association tasks. These findings provide crucial insights into how VLMs process and integrate visual information, bridging the gap between our understanding of language and vision models, and paving the way for more interpretable and controllable multimodal systems.
Researcher Affiliation	Collaboration	Clement Neo1 , Luke Ong1, Philip Torr2, Mor Geva3, David Krueger4, Fazl Barez2,5 1Nanyang Technological University 2University of Oxford 3Tel Aviv University 4MILA 5Tangentic
Pseudocode	No	The paper describes methods like 'ablation experiments' and 'attention knockout technique' in narrative text and mathematical formulas (e.g., Equation 1, 2, 3), but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our findings are first steps in understanding the internal mechanisms of VLMs, paving the way for more interpretable and controllable multimodal systems. The code for our experiments is available at https://github.com/clemneo/llava-interp.
Open Datasets	Yes	Dataset. We use images from the COCO Detection Training set (Lin et al., 2014). To ensure the reliability of our results, we employ two filtering steps:
Dataset Splits	Yes	After applying both filtering steps, our final dataset comprises 4,318 images." and "We manually curate a set of 100 images with specific questions about objects" and "To quantify this phenomenon, we analyzed 170 COCO validation images with objects of sizes between 20,000 and 30,000 square pixels (approximately 1/2 width and 1/3 height of the image).
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. It only vaguely mentions 'providing compute'.
Software Dependencies	No	The paper mentions several models like LLaVA 1.5, LLaVA-Phi1, Qwen2-VL, CLIP, and Vicuna 13B, but it does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used in the experiments.
Experiment Setup	Yes	For each ablation experiment, we create a modified set of embeddings E A: e i = e if i S ei otherwise where e = 1 N PN i=1 ei is the mean embedding across all visual tokens from 50,000 images from the Image Net validation split (Deng et al., 2009). This replaces the hypothesized object-relevant tokens with an average token, effectively ablating their specific information. We do mean ablation to preserve the norm of the image token and keep them in-distribution, as their norms are typically much higher than the norm of text tokens (Bailey et al., 2023)." and "We employ the attention knockout technique introduced by Geva et al. (2023). This method involves selectively blocking attention between specific tokens at different layers of the model, allowing us to assess the importance of various connections for the model to identify the object. ... We apply this blocking over a window of consecutive layers, including early layers (L1-10), early to middle layers (L5-14), middle layers (L11-20), middle to late layers (L15-24), and late layers (L21-31).