reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Do Vision-Language Models Really Understand Visual Language?

Authors: Yifan Hou, Buse Giledereli, Yilei Tu, Mrinmaya Sachan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across domains to evaluate the recognition and reasoning abilities of models. Our evaluation of LVLMs shows that while they can accurately identify and reason about entities, their ability to understand relationships is notably limited.
Researcher Affiliation	Academia	Yifan Hou 1 Buse Giledereli 1 Yilei Tu 1 Mrinmaya Sachan 1 EMAIL 1Department of Computer Science, ETH Z urich.
Pseudocode	No	The paper describes methodologies and experiments but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	At the same time, we have uploaded our evaluation code, generated synthetic diagram data, annotated real diagram data, as well as the responses of LVLMs in supplementary files. Every stage of our work ranging from code to results is introduced thoroughly and can be easily reproduced.
Open Datasets	Yes	To ensure that our evaluation is both controlled and generalizable, our test suite includes both clean synthetic diagrams and 1,001 annotated real diagrams carefully selected from existing datasets Krishnamurthy et al. (2016); Kembhavi et al. (2016)... The license of Food Web (Krishnamurthy et al., 2016) and AI2D (Kembhavi et al., 2016) is BSD-2-Clause and Apache-2.0 respectively.
Dataset Splits	Yes	To ensure that our evaluation is both controlled and generalizable, our test suite includes both clean synthetic diagrams and 1,001 annotated real diagrams carefully selected from existing datasets Krishnamurthy et al. (2016); Kembhavi et al. (2016)... We divide all real diagrams into five bins based on their entity count, ensuring that each bin contains more than 100 diagrams (detailed statistics are provided in Fig. 9).
Hardware Specification	No	The paper mentions that "Generally, we spend around 800$ for all experiments." but does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running experiments.
Software Dependencies	No	The paper mentions using "Word2Vec embedding (Mikolov et al., 2013) based on the text attribute, and use cosine similarity implemented by spaCy (Honnibal et al., 2020)", and evaluates several LVLMs, but it does not specify version numbers for general software dependencies like spaCy itself, Python, PyTorch, or CUDA.
Experiment Setup	Yes	We evaluate LVLMs under the Chain-of-Thought prompting (CoT, Wei et al., 2022)... Results are consistent under the zero-shot prompting (ZS) setting (App. F.2.1)... The temperature parameter is set to 0 to ensure deterministic outputs and a seed is given to the model to help with reproducibility. The max tokens is limited to 600.