VLM’s Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models
Authors: Nam Hyeon-Woo, Moon Ye-Bin, Wonseok Choi, Lee Hyun, Tae-Hyun Oh
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we propose an eye examination process to investigate how a VLM perceives images, focusing on key aspects of visual recognition, ranging from basic color and shape to semantic understanding. We introduce a dataset, LENS, to guide VLMs to follow the examination and check its readiness. Once the model is ready, we conduct the examination. We quantify and visualize VLMs sensitivities to color and shape, and semantic matching. Our findings reveal that VLMs have varying sensitivity to different colors while consistently showing insensitivity to green across different VLMs. Also, we found different shape sensitivity and semantic recognition depending on LLM s capacity despite using the same fixed visual encoder. |
| Researcher Affiliation | Collaboration | Nam Hyeon-Woo EMAIL Department of Electrical Engineering POSTECH Moon Ye-Bin EMAIL Department of Electrical Engineering POSTECH Wonseok Choi EMAIL Grad. School of AI POSTECH Lee Hyun EMAIL Department of Electrical Engineering, POSTECH Samsung AI Center, Samsung Electronics Tae-Hyun Oh EMAIL School of Computing, KAIST Department of Electrical Engineering & Grad. School of AI, POSTECH |
| Pseudocode | No | The paper describes the steps for the eye examination process (e.g., color test steps 1-3 in Section 3.1) in paragraph text. It does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured code-like formatting. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code for the methodology, nor does it include a direct link to a code repository. |
| Open Datasets | No | We introduce a dataset, LENS, to guide VLMs to follow the examination and check its readiness. More details and data statistics can be found in the Appendix. The statistics of our LENS are in Table 4. However, no specific link, DOI, or repository for the LENS dataset is provided to ensure public access. |
| Dataset Splits | Yes | To give an instruction to a model about how to perform the examination, we finetune VLMs using Lo RA (Hu et al., 2022) on the training set of LENS. Then, the test set of LENS is utilized to check the model s understanding of the instructions. Table 4: Statistics of the LENS dataset. Color Shape Semantic yes or no 1 or 2 Patch Train 2,648 6,720 3,500 1,820 3,500 * 3 Validation 568 3,360 1,000 520 1,500 * 3 |
| Hardware Specification | Yes | We use 8 A100 80G GPUs for our experiments. |
| Software Dependencies | No | The paper mentions using Lo RA (Hu et al., 2022), Adam optimizer (Kingma & Ba, 2015), and models like LLa VA (Liu et al., 2023b) and Instruct BLIP (Dai et al., 2023). However, it does not specify version numbers for general software dependencies such as programming languages (e.g., Python 3.x) or deep learning frameworks (e.g., PyTorch 1.x). |
| Experiment Setup | Yes | We set the training epoch as 2, batch size 128, and learning rate 0.0002 with cosine scheduling, Adam optimizer (Kingma & Ba, 2015) and gradient checkpointing. |