VLM’s Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models

Authors: Nam Hyeon-Woo, Moon Ye-Bin, Wonseok Choi, Lee Hyun, Tae-Hyun Oh

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we propose an eye examination process to investigate how a VLM perceives images, focusing on key aspects of visual recognition, ranging from basic color and shape to semantic understanding. We introduce a dataset, LENS, to guide VLMs to follow the examination and check its readiness. Once the model is ready, we conduct the examination. We quantify and visualize VLMs sensitivities to color and shape, and semantic matching. Our findings reveal that VLMs have varying sensitivity to different colors while consistently showing insensitivity to green across different VLMs. Also, we found different shape sensitivity and semantic recognition depending on LLM s capacity despite using the same fixed visual encoder.
Researcher Affiliation Collaboration Nam Hyeon-Woo EMAIL Department of Electrical Engineering POSTECH Moon Ye-Bin EMAIL Department of Electrical Engineering POSTECH Wonseok Choi EMAIL Grad. School of AI POSTECH Lee Hyun EMAIL Department of Electrical Engineering, POSTECH Samsung AI Center, Samsung Electronics Tae-Hyun Oh EMAIL School of Computing, KAIST Department of Electrical Engineering & Grad. School of AI, POSTECH
Pseudocode No The paper describes the steps for the eye examination process (e.g., color test steps 1-3 in Section 3.1) in paragraph text. It does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured code-like formatting.
Open Source Code No The paper does not provide an explicit statement about releasing code for the methodology, nor does it include a direct link to a code repository.
Open Datasets No We introduce a dataset, LENS, to guide VLMs to follow the examination and check its readiness. More details and data statistics can be found in the Appendix. The statistics of our LENS are in Table 4. However, no specific link, DOI, or repository for the LENS dataset is provided to ensure public access.
Dataset Splits Yes To give an instruction to a model about how to perform the examination, we finetune VLMs using Lo RA (Hu et al., 2022) on the training set of LENS. Then, the test set of LENS is utilized to check the model s understanding of the instructions. Table 4: Statistics of the LENS dataset. Color Shape Semantic yes or no 1 or 2 Patch Train 2,648 6,720 3,500 1,820 3,500 * 3 Validation 568 3,360 1,000 520 1,500 * 3
Hardware Specification Yes We use 8 A100 80G GPUs for our experiments.
Software Dependencies No The paper mentions using Lo RA (Hu et al., 2022), Adam optimizer (Kingma & Ba, 2015), and models like LLa VA (Liu et al., 2023b) and Instruct BLIP (Dai et al., 2023). However, it does not specify version numbers for general software dependencies such as programming languages (e.g., Python 3.x) or deep learning frameworks (e.g., PyTorch 1.x).
Experiment Setup Yes We set the training epoch as 2, batch size 128, and learning rate 0.0002 with cosine scheduling, Adam optimizer (Kingma & Ba, 2015) and gradient checkpointing.