Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

Authors: Nick Jiang, Anish Kachinthaya, Suzanne Petryk, Yossi Gandelsman

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that targeted edits to a model s latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. Our findings demonstrate how a deeper understanding of VLMs latent representations can enhance reliability and enable novel capabilities, such as zero-shot segmentation. We apply PROJECTAWAY on 5000 random images from the COCO2014 training set on all mentioned COCO objects (i.e. hallucination and CD) individually and measure the removal rate at which objects no longer appear in the caption. We evaluate the strength of the internal confidence co as an indicator of object presence. We sample 5000 images from the MSCOCO training set, using the image captioning objective to caption methods with both Instruct BLIP and LLa VA. Quantitative results in Table 2 show that we outperform our baselines and reduce hallucinations by 25.7% on Instruct BLIP and 23.8% on LLa VA compared to beam search.
Researcher Affiliation Academia Nick Jiang , Anish Kachinthaya , Suzie Petyrk ,Yossi Gandelsman University of California, Berkeley EMAIL
Pseudocode Yes Algorithm 1: PROJECTAWAY
Open Source Code Yes 1Code: https://github.com/nickjiang2378/vl-interp
Open Datasets Yes reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. We use Instruct BLIP and LLa VA to caption 5000 random COCO2014 images in the Karpathy validation split (Lin et al., 2015) and determine co for all 80 COCO objects... We evaluate our method on the Imagenet validation set. We filter the VQA dataset for color and object number inaccuracies and correct answers with low confidence scores (co < 0.05) using PROJECTAWAY.
Dataset Splits Yes We use Instruct BLIP and LLa VA to caption 5000 random COCO2014 images in the Karpathy validation split (Lin et al., 2015)... We apply PROJECTAWAY on 5000 random images from the COCO2014 training set... We evaluate across 500 training samples from COCO 2014 that have at least one hallucination. We apply these parameters to 500 COCO images from the Karpathy validation set. We evaluate our method on the Imagenet validation set.
Hardware Specification No No specific hardware details (like GPU or CPU models) are mentioned in the paper.
Software Dependencies No No specific software dependencies with version numbers are mentioned in the paper.
Experiment Setup Yes For Instruct BLIP, we set (l I, l T , α) = (1, 2, 1.5). For LLa VA, we set (l I, l T , α) = (19, 21, 3.5). We set p = 0.9 for nucleus sampling. We use beam search in our method and unify Nbeam = 5 for the baseline. We threshold hallucinations as co < 0.2 for Instruct BLIP and co < 0.1 for LLa VA. Based on prior ablations (Section 4.2), we select (l I = 1, l T = 2, α = 1.5) for Instruct BLIP and (l I = 19, l T = 21, α = 3.5) for LLa VA.