reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mitigating Object Hallucination in MLLMs via Data-augmented Phrase-level Alignment

Authors: Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan Arik, Tomas Pfister

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our thorough evaluation on various benchmarks conﬁrms the effectiveness of DPA in mitigating hallucination while retaining the out-of-the-box performance of the MLLMs on general tasks. For instance, MLLMs ﬁnetuned with DPA, which we refer to as Hallucination Attenuated Language and Vision Assistant (HALVA), improve F1 by up to 13.4% on hallucination visual question-answering and reduce the hallucination rate by up to 4.2% on image description tasks.
Researcher Affiliation	Collaboration	n Queen s University, o Vector Institute, p Google Deep Mind, m Google Cloud AI Research EMAIL EMAIL
Pseudocode	Yes	We present the pseudo code in Appendix A. import torch import torch.nn.functional as F def forward(self, **inputs):
Open Source Code	Yes	We open-source the code, checkpoints, and the generated hallucinated and correct response pairs used in training, at Git Hub. https://github.com/pritamqu/HALVA
Open Datasets	Yes	We prepare vision-language instructions based on Visual Genome (VG) (Krishna et al., 2017), which is an object-centric image dataset consisting of a total of 108K images and their annotations.
Dataset Splits	No	Our ﬁnal training set consists of a total of 21.5K vision-language instructions and their corresponding correct and hallucinated responses.
Hardware Specification	Yes	All experiments are conducted on 4 A100-80GB GPUs.
Software Dependencies	No	Below, we provide a Py Torch-based pseudo code.
Experiment Setup	Yes	We utilize an effective batch size of 64 and train for 1 epoch or 342 steps. The training time ranges from 1.5 to 3 hours for 7B and 13B variants. The additional implementation details are presented in Appendix D. Table S15: Details of training hyperparameters used in DPA training. HALVA7B HALVA13B HALVA13B/384 Learning rate 5e-6 2.5e-5 Learning rate scheduler Cosine Optimizer Adam W (Loshchilov & Hutter, 2017) Weight decay 0. Warmup ratio 0.03 Epoch 1 (342 steps) Batch size per GPU 16 Batch size (total) 64 α (loss coefﬁcient) 0.4 0.5 0.2 Memory optimization Zero stage 3 (Ren et al., 2021; Rajbhandari et al., 2021) Training time 1.5 hrs 3 hrs. 3 hrs.