Understanding Visual Detail Hallucinations of Large Vision-Language Models

Authors: Xiaoxi Sun, Jianxin Liang, Yueqian Wang, Huishuai Zhang, Dongyan Zhao

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, we assess 11 state-of-the-art LVLMs, yielding several key insights, as anticipated, LVLMs perform significantly worse on queries related to small objects compared to regular-sized ones, with performance on regular objects proving to be an unreliable predictor of that on small objects. This finding underscores the need for dedicated research on finegrained visual hallucinations. Second, we evaluate three training-free methods: Scaffold, Chain of Thought (Co T), and Image Resizing, all of which result in varying degrees of improvement. Furthermore, we conduct a series of detailed ablation studies on the visual encoders of Eagle-X5, examining their performance across fine-grained visual hallucination tasks.
Researcher Affiliation Academia Xiaoxi Sun1, Jianxin Liang1, Yueqian Wang1, Huishuai Zhang1 , Dongyan Zhao1,2 1Wangxuan Institute of Computer Technology, Peking University 2State Key Laboratory of General Artificial Intelligence EMAIL,EMAIL
Pseudocode No The paper describes methods and their steps in natural language but does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about releasing code, nor does it provide links to source code repositories for the described methodology.
Open Datasets Yes Inspired by the previous work [Li et al., 2023b; Rohrbach et al., 2018], we choose MSCOCO [Lin et al., 2014] as the source for our dataset.
Dataset Splits Yes Category # Questions # P / # N Existence 3000 1500/1500 Color 100 50/50 Position 400 200/200 Table 3: The statistical summary of the dataset. # Questions denotes the number of questions in the corresponding category, while # P/ # N indicates the number of positive and negative examples for the respective category.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running its experiments.
Software Dependencies No The paper mentions general software components like 'Python', 'PyTorch', etc., implicitly through references to models and methods, but it does not specify any version numbers for these software dependencies or libraries.
Experiment Setup Yes Image Resizing is a straightforward approach to addressing small object hallucinations. We use bicubic interpolation to resize the images by a factor of 2 before inputting them into the model. ... We conduct experiments on three methods using three distinct models: LLa VA, which employs a standard visual encoder; Qwen2-VL, which handles images of any resolution; and Eagle, which integrates multiple visual experts. ... Scaffold [Lei et al., 2024] is a visual prompt method that overlays a dot matrix within the image to serve as visual information anchors, utilizing multidimensional coordinates as textual positional references. ... Chain of Thought(Co T) [Wei et al., 2023] typically prompts models to generate the reasoning process before outputting the final answer. In this task, we specifically prompt models to utilize bounding boxes of objects mentioned in the question as intermediate information for generating the reasoning process.