Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models
Authors: Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, Peilin Zhao
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments illustrate that SID generates less-hallucination and higher-quality texts across various metrics, without much additional computation cost. ... 5.2 EVALUATION RESULTS In this section, we follow previous methods (Leng et al., 2024; Wang et al., 2024a; Huang et al., 2024) to evaluate the SID on CHAIR (Rohrbach et al., 2018) and POPE (Li et al., 2023d) metrics. Besides manually designed metrics, we also leverage GPT-4 assisted benchmark (Zhao et al., 2024) to evaluate attribute, location, and relation hallucinations. |
| Researcher Affiliation | Collaboration | Fushuo Huo1 , Wenchao Xu2 , Zhong Zhang3, Haozhao Wang4, Zhicheng Chen5, Peilin Zhao3 1Department of Computing, The Hong Kong Polytechnic University 2Division of Integrative Systems and Design, Hong Kong University of Science and Technology 3Tencent AI Lab, 4Huazhong University of Science and Technology, 5Tsinghua University |
| Pseudocode | No | The paper only describes steps in regular paragraph text without structured formatting of an algorithm block or pseudocode. |
| Open Source Code | Yes | Codes are available at https://github.com/huofushuo/SID. |
| Open Datasets | Yes | As for CHAIR, Following (Wang et al., 2024a; Huang et al., 2024; Yue et al., 2024b), we randomly select 500 images from the validation set of the MSCOCO (Lin et al., 2014) dataset... For the POPE metric, which comprises three datasets, we average the results in Table 4. Our method performs best overall in random, popular, and adversarial sampling settings. Specifically, in the sampling decoding setting, SID surpasses the normal sampling decoding by a large margin in a train-free manner. SID also clearly outperforms CD methods (Dola, ICD, and VCD) because the self-introspective decoding strategy amplifies vision-and-text association hallucinations then subtracts them, rather than coarsely disturbing raw inputs. Additionally, owing to the context and text-aware token selection strategy, SID is more computation-efficient than CD methods, as analyzed in Table 6. Note that beam-search based OPERA (Huang et al., 2024) shows almost no gain in the POPE metric, primarily because answering the binary classification only requires a few tokens and selecting the best beam score in a decoded sequence (N=5) brings little improvement. |
| Dataset Splits | Yes | As for CHAIR, Following (Wang et al., 2024a; Huang et al., 2024; Yue et al., 2024b), we randomly select 500 images from the validation set of the MSCOCO (Lin et al., 2014) dataset and query different LVLMs with the prompt: Please describe this image in detail. ... POPE involves 500 images from each dataset with six questions each, ultimately yielding 27,000 query-answer pairs. ... we utilize 200 images from the VG dataset and set max new tokens to 512, with the prompt of Please describe this image in detail. |
| Hardware Specification | Yes | Experiments are performed on NVIDIA V100/A100 GPUs. |
| Software Dependencies | No | The paper only mentions software names (e.g., Vicuna (Chiang & Li, 2023), LLa MA 2 (Touvron et al., 2023b), LLa MA 3 (Meta, 2024), CLIP (Radford et al., 2021)) without specific version numbers for the libraries or frameworks used in their implementation. |
| Experiment Setup | Yes | Implementation Details. As analyzed in Sec. 4.2, we set Layer i=3 and preserve top 10% least important vision tokens for Shikra, LLa VA-1.5, and LLa VA-Ne XT and i=5 and top 10% least important vision tokens for Q-former based LVLMs (Instruct BLIP) to induce fine-grained hallucinations. Hyperparameters in Eq. 2 and 3 follow VCD and ICD. More details are in Appendix A.2. ... We set the max new tokens to 512 to generate responses for fair comparisons. ... The repetition penalty is set to 1.2, as Dola suggests. OPERA, VCD, and ICD are proposed for LVLMs and we adopt the default settings. |