reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

Authors: Sreyan Ghosh, Chandra Kiran Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD consistently outperforms existing baselines 2% 33%. (Abstract) and We evaluate six LVLMs LLa VA-v1, LLa VA-v1.5, LLa VA-v1.6, m PLUG-Owl2, Intern LM-X, and Cog VLM all built on a 7B parameter language model. The models are tested across seven benchmarks: AMBER (visual recognition), Synth Do G (OCR), MMMU (expert-level reasoning), Math Vista and MATH-Vision (mathematical reasoning), MMC (chart understanding), and MME and Hallusion Bench. (Section 3)
Researcher Affiliation	Collaboration	1University of Maryland, College Park, USA, 2Adobe, USA
Pseudocode	Yes	Algorithm 1 Categorizing Visual Hallucinations
Open Source Code	Yes	We provide our code here: https://sreyan88.github.io/VDGD/
Open Datasets	Yes	For evaluation, we employ a variety of standard benchmarks focused on reasoning and information-seeking tasks. These include LLa VA-Bench, MM-Vet (Yu et al., 2023), MMBench (Liu et al., 2023d), MME (Fu et al., 2023), Math Vista (test-mini subset), Math Vision, and MMMU (validation set). (Section 5.2) and Va LLu We propose Va LLu benchmark which is sourced from Oven, MMMU, MMC, Math Vista, Hallusion Bench, MATH-Vision and MME. This dataset is licensed under all the licenses of the original benchmarks that it was sourced from. (Section H)
Dataset Splits	Yes	Math Vista (test-mini subset), Math Vision, and MMMU (validation set). (Section 5.2) and Synth Do G (Kim et al., 2022)... consists of 65.5k training and 500 validation entries. (Section H)
Hardware Specification	Yes	All our analysis, inference and baseline experiments are conducted on a node of 4 NVIDIA RTX A6000 GPUs, with 128GB RAM and 10 CPU cores.
Software Dependencies	Yes	The evaluations are conducted using gpt-4-turbo-2024-04-09 model.
Experiment Setup	Yes	We employ greedy decoding for all methods as we find no difference in performance on sampling. For Vanilla-sampling with multinomial sampling-based decoding (top-p=0.5 and temperature=0.7). (Section 5.2) and where α is a hyper-parameter between [0,1]. (Equation 2, Section 5.1). All results are averaged across 3 runs. (Section 5.2)