Reducing Hallucinations in Large Vision-Language Models via Latent Space Steering

Authors: Sheng Liu, Haotian Ye, James Y Zou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that it can effectively reduce hallucinations and outperform baseline methods across multiple metrics, highlighting the critical role of vision feature stability in LVLMs.
Researcher Affiliation Academia Sheng Liu Haotian Ye James Zou Stanford University EMAIL
Pseudocode No Figure 3: Overview of the proposed algorithm visual and textual test-time intervention (VTI). Given a example set {(vi, xi, xi)}N i=1 where vi is the vision input and (xi, xi) is paired captions with and without hallucination, VTI first runs the model on each query (vi, xi, xi) and records all hidden states. It then computes the shifting vectors dvision l,t and dtext l,t for all layer l and token t according to Section 4. During inference, the vectors are subsequently added to every layer of the vision encoder and text decoder, respectively, when processing a new query. Notice that the vectors are taskand dataset-agnostic, i.e., they are pre-computed using a few samples from one specific task and dataset, and fixed unchanged throughout the entire experiments in our paper.
Open Source Code Yes Code is available at https://github.com/shengliu66/VTI.
Open Datasets Yes We evaluate our model on both discriminative and generative datasets, as listed below. More details about the datasets are provided in the appendix. (a) POPE: The Polling-based Object Probing Evaluation (Li et al., 2023c) contains 27,000 Yes/No questions about object existence in MSCOCO (Lin et al., 2014), where the task is to judge whether the given object is in the given image (examples are provided in Figure 7). Following existing works, we compute accuracy, precision, recall, and F1 score for each method. (b) CHAIR: Caption Hallucination Assessment with Image Relevance (Rohrbach et al., 2018) quantifies object hallucinations in image captions by comparing generated objects to ground-truth objects. Following previous works (Huang et al., 2023; Yue et al., 2024), we randomly select 500 images from the MSCOCO dataset (Lin et al., 2014) and use CHAIRI, CHAIRS, and Recall as evaluation metrics. (c) MMHAL-Bench (Sun et al., 2023): This benchmark evaluates LVLMs beyond object hallucination and contains eight question types: object attributes, adversarial objects, comparisons, counting, spatial relations, environment, holistic description, and others.
Dataset Splits Yes We evaluate our model on both discriminative and generative datasets, as listed below. More details about the datasets are provided in the appendix. (a) POPE: The Polling-based Object Probing Evaluation (Li et al., 2023c) contains 27,000 Yes/No questions about object existence in MSCOCO (Lin et al., 2014)... (b) CHAIR: ...we randomly select 500 images from the MSCOCO dataset (Lin et al., 2014)... (c) MMHAL-Bench (Sun et al., 2023): This benchmark evaluates LVLMs beyond object hallucination and contains eight question types... We utilize the official benchmark from Li et al. (2023c), which includes 3,000 question-answer pairs for each of the random, popular, and adversarial settings.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers).
Experiment Setup Yes We perform a grid search for the strength of vision and text vectors where α, β {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}. For baseline methods, we followed the settings in their papers and released code to ensure a fair comparison. More details are provided in the Appendix A. In all experimental setups, the mask ratio to compute visual direction is set to 0.99, and we average across 50 random masks. For experiments on CHAIR, to maintain similar lengths of generations, we set α = 0.4 for visual intervention only, and similarly, β = 0.4 for textual intervention only. For VTI. α = β = 0.4. For experiments on MMHAL-Bench, we adopted α = 0.9 and β = 0.9.