Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow

Authors: Jiaqi Bai, Hongcheng Guo, Zhongyuan Peng, Jian Yang, Zhoujun Li, Mohan Li, Zhihong Tian

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the proposed ADAVIB mitigates object hallucinations by effectively alleviating the overconfidence in irrelevant visual features, with consistent improvements on two object hallucination benchmarks. Experiments Datasets and Evaluation Metrics MSCOCO (Lin et al. 2014) POPE (Li et al. 2023c) Baselines Implementation Details Evaluation Results Results on MSCOCO Results on POPE
Researcher Affiliation Academia Jiaqi Bai1,2, Hongcheng Guo3, Zhongyuan Peng4, Jian Yang3, Zhoujun Li3, Mohan Li1,2*, Zhihong Tian1,2 1Cyberspace Institute of Advanced Technology, Guangzhou University, China 2Huangpu Research School of Guangzhou University, China 3CCSE, Beihang University, China 4University of the Chinese Academy of Sciences, China EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using mathematical equations and textual explanations, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code Yes Code https://github.com/jiaqi5598/AdaVIB
Open Datasets Yes MSCOCO (Lin et al. 2014) The Microsoft Common Objects in Context (MSCOCO) stands as a comprehensive dataset for evaluating various visual tasks, including image recognition, segmentation, and captioning. POPE (Li et al. 2023c) The Polling-based Object Probing Evaluation (POPE) is a widely adopted benchmark for assessing object hallucination on the Visual Question Answering (VQA) task.
Dataset Splits Yes To train the vision-language projector, we randomly select 5000 image-text pairs from LLa VA-150k (Liu et al. 2024b), which is a set of GPT-generated multi-modal instruction-following data grounded on the images from COCO2014. Following Zhou et al. (2024), we additionally select 5000 unique images from the training dataset of COCO2014 to evaluate object hallucinations, ensuring that the selected images do not overlap with those used in training. The dataset comprises three splits: In Random split, the absent objects are randomly selected from the whole dataset. In Popular split, the absent objects are chosen from the most frequently appeared objects in the dataset. In Adversarial split, the absent objects are selected from those frequently co-occurred with ground-truth objects. Each split is composed of 3000 questions on images taken from the validation set of COCO2014.
Hardware Specification Yes We use an A100-PCIE-40G GPU for training, which takes approximately 20 minutes for Mini GPT4 and 40 minutes for LLa Va1.5.
Software Dependencies No The paper mentions specific models and components like "Vicuna-7B", "Mini GPT4", "LLa Va-1.5", and "CLIP", but it does not specify version numbers for general software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries.
Experiment Setup Yes We train all models in one epoch to avoid overfitting. All hyperparameters of baselines are selected via cross-validation on the training dataset of MSCOCO. Specifically, the Lagrange multiplier β is set to β = 1e 7 unless explicitly specified. We set the batch size to 2 with gradient accumulation steps to 8. The maximum sequence length during training is set to 512. We use greedy decoding with a maximum decoding length of 256 during inference. We set the learning rate to 3e 5 with a weight decay of 0.05, and use a linear warm-up schedule for the first 1/10 optimization steps, followed by a polynomial decay.