reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Intervening Anchor Token: Decoding Strategy in Alleviating Hallucinations for MLLMs

Authors: Barrett Tang, Zile Huang, Chengzhi Liu, Qiang Sun, Harry Yang, Ser-Nam Lim

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments reveal a correlation between the eigenspectrum and hallucinations across various MLLMs and show that TAME reduces the percentage of hallucinated objects. Code released at https://github.com/Everlyn-Labs/ANTRP.
Researcher Affiliation	Collaboration	Feilong Tang1,2 , Zile Huang1,2 , Chengzhi Liu4, Qiang Sun2,3, Harry Yang1,2, Ser-Nam Lim2,5 1HKUST, 2Everlyn AI, 3University of Toronto, 4University of Liverpool, 5UCF EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Pseudo-code of TAME in a Py Torch-like style.
Open Source Code	Yes	Code released at https://github.com/Everlyn-Labs/ANTRP.
Open Datasets	Yes	We perform the CHAIR evaluation on the MSCOCO dataset (Lin et al., 2014)...Hallu Bench (Zhao et al., 2023) represents a more advanced benchmark, utilizing detailed object-level descriptions from the VG dataset (Krishna et al., 2017)...SEED-Bench (Li et al., 2023a)...GQA (Hudson & Manning, 2019)...Vizwiz (Gurari et al., 2018)...MME (Fu et al., 2023)...MMBench (Liu et al., 2025b)...POPE (Li et al., 2023c)...Wikitext-103 (Merity et al., 2016) and Mini Pile (Kaddour, 2023) datasets
Dataset Splits	Yes	Following the Baseline method, we randomly select 500 images from the validation set of COCO 2014 and prompt various MLLMs)...The evaluation is conducted across three distinct splits: the random split, where objects are randomly selected from the entire dataset; the popular split, which evaluates the recognition of frequently occurring objects; and the adversarial split, which assesses the ability of model to detect objects closely related to those present in the image.
Hardware Specification	Yes	Experiments are performed on NVIDIA H20/H100 GPUs.
Software Dependencies	No	Algorithm 1 Pseudo-code of TAME in a Py Torch-like style. The paper mentions PyTorch but does not specify a version number for it or any other key software components.
Experiment Setup	Yes	Basically, the hyperparameter gamma of TAME is set to the default value of 1. Other parameters use the default settings, same as the Baseline...To ensure a fair evaluation, we impose two different maximum token limits, as the length of generated sequences can significantly affect CHAIR scores (CS and CI).