Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Authors: Nick Jiang, Anish Kachinthaya, Suzanne Petryk, Yossi Gandelsman
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that targeted edits to a model s latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. Our findings demonstrate how a deeper understanding of VLMs latent representations can enhance reliability and enable novel capabilities, such as zero-shot segmentation. We apply PROJECTAWAY on 5000 random images from the COCO2014 training set on all mentioned COCO objects (i.e. hallucination and CD) individually and measure the removal rate at which objects no longer appear in the caption. We evaluate the strength of the internal confidence co as an indicator of object presence. We sample 5000 images from the MSCOCO training set, using the image captioning objective to caption methods with both Instruct BLIP and LLa VA. Quantitative results in Table 2 show that we outperform our baselines and reduce hallucinations by 25.7% on Instruct BLIP and 23.8% on LLa VA compared to beam search. |
| Researcher Affiliation | Academia | Nick Jiang , Anish Kachinthaya , Suzie Petyrk ,Yossi Gandelsman University of California, Berkeley EMAIL |
| Pseudocode | Yes | Algorithm 1: PROJECTAWAY |
| Open Source Code | Yes | 1Code: https://github.com/nickjiang2378/vl-interp |
| Open Datasets | Yes | reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. We use Instruct BLIP and LLa VA to caption 5000 random COCO2014 images in the Karpathy validation split (Lin et al., 2015) and determine co for all 80 COCO objects... We evaluate our method on the Imagenet validation set. We filter the VQA dataset for color and object number inaccuracies and correct answers with low confidence scores (co < 0.05) using PROJECTAWAY. |
| Dataset Splits | Yes | We use Instruct BLIP and LLa VA to caption 5000 random COCO2014 images in the Karpathy validation split (Lin et al., 2015)... We apply PROJECTAWAY on 5000 random images from the COCO2014 training set... We evaluate across 500 training samples from COCO 2014 that have at least one hallucination. We apply these parameters to 500 COCO images from the Karpathy validation set. We evaluate our method on the Imagenet validation set. |
| Hardware Specification | No | No specific hardware details (like GPU or CPU models) are mentioned in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers are mentioned in the paper. |
| Experiment Setup | Yes | For Instruct BLIP, we set (l I, l T , α) = (1, 2, 1.5). For LLa VA, we set (l I, l T , α) = (19, 21, 3.5). We set p = 0.9 for nucleus sampling. We use beam search in our method and unify Nbeam = 5 for the baseline. We threshold hallucinations as co < 0.2 for Instruct BLIP and co < 0.1 for LLa VA. Based on prior ablations (Section 4.2), we select (l I = 1, l T = 2, α = 1.5) for Instruct BLIP and (l I = 19, l T = 21, α = 3.5) for LLa VA. |