Do Vision-Language Models Really Understand Visual Language?
Authors: Yifan Hou, Buse Giledereli, Yilei Tu, Mrinmaya Sachan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across domains to evaluate the recognition and reasoning abilities of models. Our evaluation of LVLMs shows that while they can accurately identify and reason about entities, their ability to understand relationships is notably limited. |
| Researcher Affiliation | Academia | Yifan Hou 1 Buse Giledereli 1 Yilei Tu 1 Mrinmaya Sachan 1 EMAIL 1Department of Computer Science, ETH Z urich. |
| Pseudocode | No | The paper describes methodologies and experiments but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | At the same time, we have uploaded our evaluation code, generated synthetic diagram data, annotated real diagram data, as well as the responses of LVLMs in supplementary files. Every stage of our work ranging from code to results is introduced thoroughly and can be easily reproduced. |
| Open Datasets | Yes | To ensure that our evaluation is both controlled and generalizable, our test suite includes both clean synthetic diagrams and 1,001 annotated real diagrams carefully selected from existing datasets Krishnamurthy et al. (2016); Kembhavi et al. (2016)... The license of Food Web (Krishnamurthy et al., 2016) and AI2D (Kembhavi et al., 2016) is BSD-2-Clause and Apache-2.0 respectively. |
| Dataset Splits | Yes | To ensure that our evaluation is both controlled and generalizable, our test suite includes both clean synthetic diagrams and 1,001 annotated real diagrams carefully selected from existing datasets Krishnamurthy et al. (2016); Kembhavi et al. (2016)... We divide all real diagrams into five bins based on their entity count, ensuring that each bin contains more than 100 diagrams (detailed statistics are provided in Fig. 9). |
| Hardware Specification | No | The paper mentions that "Generally, we spend around 800$ for all experiments." but does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper mentions using "Word2Vec embedding (Mikolov et al., 2013) based on the text attribute, and use cosine similarity implemented by spaCy (Honnibal et al., 2020)", and evaluates several LVLMs, but it does not specify version numbers for general software dependencies like spaCy itself, Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We evaluate LVLMs under the Chain-of-Thought prompting (CoT, Wei et al., 2022)... Results are consistent under the zero-shot prompting (ZS) setting (App. F.2.1)... The temperature parameter is set to 0 to ensure deterministic outputs and a seed is given to the model to help with reproducibility. The max tokens is limited to 600. |