reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow

Authors: Jiaqi Bai, Hongcheng Guo, Zhongyuan Peng, Jian Yang, Zhoujun Li, Mohan Li, Zhihong Tian

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate the proposed ADAVIB mitigates object hallucinations by effectively alleviating the overconfidence in irrelevant visual features, with consistent improvements on two object hallucination benchmarks. Experiments Datasets and Evaluation Metrics MSCOCO (Lin et al. 2014) POPE (Li et al. 2023c) Baselines Implementation Details Evaluation Results Results on MSCOCO Results on POPE
Researcher Affiliation	Academia	Jiaqi Bai1,2, Hongcheng Guo3, Zhongyuan Peng4, Jian Yang3, Zhoujun Li3, Mohan Li1,2*, Zhihong Tian1,2 1Cyberspace Institute of Advanced Technology, Guangzhou University, China 2Huangpu Research School of Guangzhou University, China 3CCSE, Beihang University, China 4University of the Chinese Academy of Sciences, China EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using mathematical equations and textual explanations, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	Yes	Code https://github.com/jiaqi5598/AdaVIB
Open Datasets	Yes	MSCOCO (Lin et al. 2014) The Microsoft Common Objects in Context (MSCOCO) stands as a comprehensive dataset for evaluating various visual tasks, including image recognition, segmentation, and captioning. POPE (Li et al. 2023c) The Polling-based Object Probing Evaluation (POPE) is a widely adopted benchmark for assessing object hallucination on the Visual Question Answering (VQA) task.
Dataset Splits	Yes	To train the vision-language projector, we randomly select 5000 image-text pairs from LLa VA-150k (Liu et al. 2024b), which is a set of GPT-generated multi-modal instruction-following data grounded on the images from COCO2014. Following Zhou et al. (2024), we additionally select 5000 unique images from the training dataset of COCO2014 to evaluate object hallucinations, ensuring that the selected images do not overlap with those used in training. The dataset comprises three splits: In Random split, the absent objects are randomly selected from the whole dataset. In Popular split, the absent objects are chosen from the most frequently appeared objects in the dataset. In Adversarial split, the absent objects are selected from those frequently co-occurred with ground-truth objects. Each split is composed of 3000 questions on images taken from the validation set of COCO2014.
Hardware Specification	Yes	We use an A100-PCIE-40G GPU for training, which takes approximately 20 minutes for Mini GPT4 and 40 minutes for LLa Va1.5.
Software Dependencies	No	The paper mentions specific models and components like "Vicuna-7B", "Mini GPT4", "LLa Va-1.5", and "CLIP", but it does not specify version numbers for general software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries.
Experiment Setup	Yes	We train all models in one epoch to avoid overfitting. All hyperparameters of baselines are selected via cross-validation on the training dataset of MSCOCO. Specifically, the Lagrange multiplier β is set to β = 1e 7 unless explicitly specified. We set the batch size to 2 with gradient accumulation steps to 8. The maximum sequence length during training is set to 512. We use greedy decoding with a maximum decoding length of 256 during inference. We set the learning rate to 3e 5 with a weight decay of 0.05, and use a linear warm-up schedule for the first 1/10 optimization steps, followed by a polynomial decay.