HalluEntity: Benchmarking and Understanding Entity-Level Hallucination Detection

Authors: Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Based on the dataset, we comprehensively evaluate uncertainty-based hallucination detection approaches across 17 modern LLMs. Our experimental results show that uncertainty estimation approaches focusing on individual token probabilities tend to over-predict hallucinations, while context-aware methods show better but still suboptimal performance.
Researcher Affiliation Academia Min-Hsuan Yeh EMAIL University of Wisconsin-Madison Max Kamachee EMAIL University of Wisconsin-Madison Seongheon Park EMAIL University of Wisconsin-Madison Yixuan Li EMAIL University of Wisconsin-Madison
Pseudocode No The paper describes five uncertainty calculation methods (Likelihood, Entropy, Claim-Conditioned Probability (CCP), Shifting Attention to Relevance (SAR), Focus) using mathematical formulas and textual descriptions, but it does not include any explicit pseudocode blocks or algorithms.
Open Source Code No The paper provides a link to the Hallu Entity dataset: 'Hallu Entity: https://huggingface.co/datasets/samuelyeh/Hallu Entity', but it does not provide concrete access to the source code for the methodology described in the paper.
Open Datasets Yes To address this limitation, we explore entity-level hallucination detection. We propose a new data set, Hallu Entity, which annotates hallucination at the entity level. ... Hallu Entity: https://huggingface.co/datasets/samuelyeh/Hallu Entity ... Hallu Entity is publicly released under the MIT license.
Dataset Splits No Hallu Entity comprises 157 instances containing a total of 18,785 entities, with 5,452 unique entities. ... We categorize Hallu Entity into three groups based on the hallucination rate the proportion of hallucinated entities in each generation. ... The paper provides statistics and categorizations of the dataset for analysis, but it does not specify explicit training, validation, or test splits for reproducing model training.
Hardware Specification Yes We conducted all experiments on a server equipped with eight Nvidia A100 GPUs.
Software Dependencies No In our experiment, we use the top 10 alternatives and use DeBERTa-base (He et al., 2021) as the NLI model. ... Following Duan et al. (2024), we use Sentence BERT (Reimers & Gurevych, 2019) with RoBERTa-large (Liu et al., 2019) for embedding extraction. ... K is keyword set identified by Spacy (Honnibal & Montani, 2017). ... The paper mentions specific software components like DeBERTa-base, Sentence BERT, RoBERTa-large, and Spacy, but it does not provide specific version numbers for these software dependencies.
Experiment Setup Yes In our experiment, we use the top 10 alternatives and use DeBERTa-base (He et al., 2021) as the NLI model. ... Following Zhang et al. (2023b), the token IDF is calculated based on 1M documents sampled from Red Pajama dataset (Weber et al., 2024), and the hyperparameter γ for pi is set to be 0.9.