Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs
Authors: Sreyan Ghosh, Chandra Kiran Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD consistently outperforms existing baselines 2% 33%. (Abstract) and We evaluate six LVLMs LLa VA-v1, LLa VA-v1.5, LLa VA-v1.6, m PLUG-Owl2, Intern LM-X, and Cog VLM all built on a 7B parameter language model. The models are tested across seven benchmarks: AMBER (visual recognition), Synth Do G (OCR), MMMU (expert-level reasoning), Math Vista and MATH-Vision (mathematical reasoning), MMC (chart understanding), and MME and Hallusion Bench. (Section 3) |
| Researcher Affiliation | Collaboration | 1University of Maryland, College Park, USA, 2Adobe, USA |
| Pseudocode | Yes | Algorithm 1 Categorizing Visual Hallucinations |
| Open Source Code | Yes | We provide our code here: https://sreyan88.github.io/VDGD/ |
| Open Datasets | Yes | For evaluation, we employ a variety of standard benchmarks focused on reasoning and information-seeking tasks. These include LLa VA-Bench, MM-Vet (Yu et al., 2023), MMBench (Liu et al., 2023d), MME (Fu et al., 2023), Math Vista (test-mini subset), Math Vision, and MMMU (validation set). (Section 5.2) and Va LLu We propose Va LLu benchmark which is sourced from Oven, MMMU, MMC, Math Vista, Hallusion Bench, MATH-Vision and MME. This dataset is licensed under all the licenses of the original benchmarks that it was sourced from. (Section H) |
| Dataset Splits | Yes | Math Vista (test-mini subset), Math Vision, and MMMU (validation set). (Section 5.2) and Synth Do G (Kim et al., 2022)... consists of 65.5k training and 500 validation entries. (Section H) |
| Hardware Specification | Yes | All our analysis, inference and baseline experiments are conducted on a node of 4 NVIDIA RTX A6000 GPUs, with 128GB RAM and 10 CPU cores. |
| Software Dependencies | Yes | The evaluations are conducted using gpt-4-turbo-2024-04-09 model. |
| Experiment Setup | Yes | We employ greedy decoding for all methods as we find no difference in performance on sampling. For Vanilla-sampling with multinomial sampling-based decoding (top-p=0.5 and temperature=0.7). (Section 5.2) and where α is a hyper-parameter between [0,1]. (Equation 2, Section 5.1). All results are averaged across 3 runs. (Section 5.2) |