reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HalluEntity: Benchmarking and Understanding Entity-Level Hallucination Detection

Authors: Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Based on the dataset, we comprehensively evaluate uncertainty-based hallucination detection approaches across 17 modern LLMs. Our experimental results show that uncertainty estimation approaches focusing on individual token probabilities tend to over-predict hallucinations, while context-aware methods show better but still suboptimal performance.
Researcher Affiliation	Academia	Min-Hsuan Yeh EMAIL University of Wisconsin-Madison Max Kamachee EMAIL University of Wisconsin-Madison Seongheon Park EMAIL University of Wisconsin-Madison Yixuan Li EMAIL University of Wisconsin-Madison
Pseudocode	No	The paper describes five uncertainty calculation methods (Likelihood, Entropy, Claim-Conditioned Probability (CCP), Shifting Attention to Relevance (SAR), Focus) using mathematical formulas and textual descriptions, but it does not include any explicit pseudocode blocks or algorithms.
Open Source Code	No	The paper provides a link to the Hallu Entity dataset: 'Hallu Entity: https://huggingface.co/datasets/samuelyeh/Hallu Entity', but it does not provide concrete access to the source code for the methodology described in the paper.
Open Datasets	Yes	To address this limitation, we explore entity-level hallucination detection. We propose a new data set, Hallu Entity, which annotates hallucination at the entity level. ... Hallu Entity: https://huggingface.co/datasets/samuelyeh/Hallu Entity ... Hallu Entity is publicly released under the MIT license.
Dataset Splits	No	Hallu Entity comprises 157 instances containing a total of 18,785 entities, with 5,452 unique entities. ... We categorize Hallu Entity into three groups based on the hallucination rate the proportion of hallucinated entities in each generation. ... The paper provides statistics and categorizations of the dataset for analysis, but it does not specify explicit training, validation, or test splits for reproducing model training.
Hardware Specification	Yes	We conducted all experiments on a server equipped with eight Nvidia A100 GPUs.
Software Dependencies	No	In our experiment, we use the top 10 alternatives and use DeBERTa-base (He et al., 2021) as the NLI model. ... Following Duan et al. (2024), we use Sentence BERT (Reimers & Gurevych, 2019) with RoBERTa-large (Liu et al., 2019) for embedding extraction. ... K is keyword set identified by Spacy (Honnibal & Montani, 2017). ... The paper mentions specific software components like DeBERTa-base, Sentence BERT, RoBERTa-large, and Spacy, but it does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	In our experiment, we use the top 10 alternatives and use DeBERTa-base (He et al., 2021) as the NLI model. ... Following Zhang et al. (2023b), the token IDF is calculated based on 1M documents sampled from Red Pajama dataset (Weber et al., 2024), and the hyperparameter γ for pi is set to be 0.9.