reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning

Authors: Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, Tat-Seng Chua

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate significant improvements in multiple hallucination benchmarks after integrating MLLMs with the proposed framework. In-depth analyses reveal the great potential of our methods in addressing perceptionand cognition-level hallucinations. We conduct extensive experiments on six benchmarks, demonstrating that the existing MLLMs equipped with our proposed method show significant improvement in mitigating hallucination. In-depth analyses and visualizations show that our method helps decrease conflicts in input questions, thereby reducing erroneous outputs. Overall, our contributions can be summarized into four aspects: Drawing inspiration from human reasoning, we propose a novel holistic bottom-up reasoning framework for MLLM de-hallucination, spanning from perception to cognition. To rigorously evaluate the performance of the proposed framework, we selected two categories of benchmarks based on the levels at which hallucinations typically occur: 1) Perception-level benchmarks are used to test the model s ability to dehallucinate visual content concerning objects, attributes, and relationships. This includes benchmarks such as POPE (Li et al. 2023), PHD (Liu et al. 2024c), AMBER (Wang et al. 2023a) and WHOOPS!-VQA (Guetta et al. 2023). 2) Cognition-level benchmarks are aimed at evaluating the model on more complex issues, such as unanswerable or ambiguous questions, or those requiring common-Model OR AR SA PR C Ph D Avg. WHOOPS! Neu. Mis. Neu. Mis. Neu. Mis. Neu. Mis. Neu. Mis. Neu. Mis. VQA Gen. LLa VA-1.5 65.9 22.5 62.6 11.8 69.0 32.8 47.9 14.5 47.3 11.7 58.5 18.7 47.3 67.9 + Ours 67.5 35.4 67.0 24.3 76.3 46.5 53.3 29.0 52.1 19.8 63.2 (+4.7) 31.0 (+12.3) 54.5 (+7.2) 72.3 (+4.4) Qwen-VL-Chat 79.5 46.3 80.9 42.1 73.6 37.9 69.1 43.1 57.6 32.8 72.1 40.4 48.7 67.5 + Ours 81.8 56.8 86.9 54.0 77.4 47.3 83.4 62.4 64.1 42.3 78.7 (+6.6) 52.5 (+12.1) 54.3 (+5.6) 73.4 (+5.9) Mini GPT-V2 84.5 43.3 71.5 26.1 78.1 20.5 62.7 35.3 66.1 28.7 72.6 30.8 49.1 71.3 + Ours 86.0 59.9 76.1 38.0 79.9 42.4 71.0 61.8 68.2 48.8 76.2 (+3.6) 50.2 (+19.4) 51.6 (+2.5) 75.6 (+4.3) GPT-4V 83.2 76.4 76.2 28.6 76.0 47.2 59.7 42.5 57.6 40.6 70.5 47.1 64.8 81.7 + Ours 88.0 87.3 85.8 43.7 81.7 65.5 61.8 52.0 85.0 75.4 80.5(+10.0) 64.8 (+17.7) 69.9 (+5.1) 89.8 (+8.1) Table 1: Evaluation on Ph D and WHOOPS! benchmarks. The Ph D dataset is split into neural (Neu.), and misleading (Mis.) questions in Object Recognition (OR), Attribute Recognition (AR), Sentiment Analysis (SA), and Positional Reasoning (PR), and Counting (C). Ph D Avg. denotes the average performance on the Ph D dataset. For the WHOOPS! benchmark, we evaluate our method on the compositional VQA and explanation generation (Gen.) tasks.
Researcher Affiliation	Collaboration	1National University of Singapore, Singapore 2University of Arizona, USA 3University of California, Santa Barbara, USA 4Skywork AI, Singapore 5Nanyang Technological University, Singapore EMAIL, EMAIL, EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes its methodology in Section 3, detailing six reasoning modules (Target Identification and Visual Perception; Visual Perception Verification; Question Validation and Adjustment; Commonsense Induction; Commonsense Verification; Question Answering). Each module is explained in paragraph text with mathematical formulations for inputs and outputs, but no structured pseudocode or algorithm blocks are provided.
Open Source Code	No	The paper states: "Our framework operates without training, leveraging an open-source pre-trained model to assess performance." and "the scores with * are derived from (Kim, Kim, and Ro 2024), are copied from (Wu et al. 2024a), are re-implemented based on the open-source code." These statements indicate the use of existing open-source models/code, but there is no explicit statement or link provided for the code implementation of the proposed framework itself.
Open Datasets	Yes	To rigorously evaluate the performance of the proposed framework, we selected two categories of benchmarks based on the levels at which hallucinations typically occur: 1) Perception-level benchmarks are used to test the model s ability to dehallucinate visual content concerning objects, attributes, and relationships. This includes benchmarks such as POPE (Li et al. 2023), PHD (Liu et al. 2024c), AMBER (Wang et al. 2023a) and WHOOPS!-VQA (Guetta et al. 2023). 2) Cognition-level benchmarks are aimed at evaluating the model on more complex issues, such as unanswerable or ambiguous questions, or those requiring common-sense knowledge, such as sentiment analysis. For this purpose, we selected representative datasets like WHOOPS!Gen (Guetta et al. 2023) and VQAv2-IDK (Cha et al. 2024).
Dataset Splits	No	The paper mentions: "The Ph D dataset is split into neural (Neu.), and misleading (Mis.) questions in Object Recognition (OR), Attribute Recognition (AR), Sentiment Analysis (SA), and Positional Reasoning (PR), and Counting (C)." While this describes a type of split, it does not provide specific percentages or absolute counts for training, validation, or test sets. For other datasets, standard benchmarks are referenced, but the specific splits used in the authors' experiments are not explicitly detailed (e.g., "80/10/10 split" or
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper states: "We employ Grounding DINO (Liu et al. 2023b) for object and attribute verification and BLIP (Li et al. 2022b) for validating the existence of relationships." While it names these tools, it does not provide specific version numbers for them or any other software dependencies.
Experiment Setup	No	The paper mentions that "Our framework operates without training" and details the logical steps of its six reasoning modules. However, it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, temperature for LLMs), optimizer settings, or other concrete system-level configurations that would be needed to reproduce the experimental results beyond the high-level description of the modules.