Mitigating Object Hallucination in MLLMs via Data-augmented Phrase-level Alignment
Authors: Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan Arik, Tomas Pfister
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our thorough evaluation on various benchmarks confirms the effectiveness of DPA in mitigating hallucination while retaining the out-of-the-box performance of the MLLMs on general tasks. For instance, MLLMs finetuned with DPA, which we refer to as Hallucination Attenuated Language and Vision Assistant (HALVA), improve F1 by up to 13.4% on hallucination visual question-answering and reduce the hallucination rate by up to 4.2% on image description tasks. |
| Researcher Affiliation | Collaboration | n Queen s University, o Vector Institute, p Google Deep Mind, m Google Cloud AI Research EMAIL EMAIL |
| Pseudocode | Yes | We present the pseudo code in Appendix A. import torch import torch.nn.functional as F def forward(self, **inputs): |
| Open Source Code | Yes | We open-source the code, checkpoints, and the generated hallucinated and correct response pairs used in training, at Git Hub. https://github.com/pritamqu/HALVA |
| Open Datasets | Yes | We prepare vision-language instructions based on Visual Genome (VG) (Krishna et al., 2017), which is an object-centric image dataset consisting of a total of 108K images and their annotations. |
| Dataset Splits | No | Our final training set consists of a total of 21.5K vision-language instructions and their corresponding correct and hallucinated responses. |
| Hardware Specification | Yes | All experiments are conducted on 4 A100-80GB GPUs. |
| Software Dependencies | No | Below, we provide a Py Torch-based pseudo code. |
| Experiment Setup | Yes | We utilize an effective batch size of 64 and train for 1 epoch or 342 steps. The training time ranges from 1.5 to 3 hours for 7B and 13B variants. The additional implementation details are presented in Appendix D. Table S15: Details of training hyperparameters used in DPA training. HALVA7B HALVA13B HALVA13B/384 Learning rate 5e-6 2.5e-5 Learning rate scheduler Cosine Optimizer Adam W (Loshchilov & Hutter, 2017) Weight decay 0. Warmup ratio 0.03 Epoch 1 (342 steps) Batch size per GPU 16 Batch size (total) 64 α (loss coefficient) 0.4 0.5 0.2 Memory optimization Zero stage 3 (Ren et al., 2021; Rajbhandari et al., 2021) Training time 1.5 hrs 3 hrs. 3 hrs. |