Intervening Anchor Token: Decoding Strategy in Alleviating Hallucinations for MLLMs

Authors: Barrett Tang, Zile Huang, Chengzhi Liu, Qiang Sun, Harry Yang, Ser-Nam Lim

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments reveal a correlation between the eigenspectrum and hallucinations across various MLLMs and show that TAME reduces the percentage of hallucinated objects. Code released at https://github.com/Everlyn-Labs/ANTRP.
Researcher Affiliation Collaboration Feilong Tang1,2 , Zile Huang1,2 , Chengzhi Liu4, Qiang Sun2,3, Harry Yang1,2, Ser-Nam Lim2,5 1HKUST, 2Everlyn AI, 3University of Toronto, 4University of Liverpool, 5UCF EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Pseudo-code of TAME in a Py Torch-like style.
Open Source Code Yes Code released at https://github.com/Everlyn-Labs/ANTRP.
Open Datasets Yes We perform the CHAIR evaluation on the MSCOCO dataset (Lin et al., 2014)...Hallu Bench (Zhao et al., 2023) represents a more advanced benchmark, utilizing detailed object-level descriptions from the VG dataset (Krishna et al., 2017)...SEED-Bench (Li et al., 2023a)...GQA (Hudson & Manning, 2019)...Vizwiz (Gurari et al., 2018)...MME (Fu et al., 2023)...MMBench (Liu et al., 2025b)...POPE (Li et al., 2023c)...Wikitext-103 (Merity et al., 2016) and Mini Pile (Kaddour, 2023) datasets
Dataset Splits Yes Following the Baseline method, we randomly select 500 images from the validation set of COCO 2014 and prompt various MLLMs)...The evaluation is conducted across three distinct splits: the random split, where objects are randomly selected from the entire dataset; the popular split, which evaluates the recognition of frequently occurring objects; and the adversarial split, which assesses the ability of model to detect objects closely related to those present in the image.
Hardware Specification Yes Experiments are performed on NVIDIA H20/H100 GPUs.
Software Dependencies No Algorithm 1 Pseudo-code of TAME in a Py Torch-like style. The paper mentions PyTorch but does not specify a version number for it or any other key software components.
Experiment Setup Yes Basically, the hyperparameter gamma of TAME is set to the default value of 1. Other parameters use the default settings, same as the Baseline...To ensure a fair evaluation, we impose two different maximum token limits, as the length of generated sequences can significantly affect CHAIR scores (CS and CI).