Enhancing Uncertainty Modeling with Semantic Graph for Hallucination Detection

Authors: Kedi Chen, Qin Chen, Jie Zhou, Xinqi Tao, Bowen Ding, Jingwen Xie, Mingchen Xie, Peilong Li, Zheng Feng

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two datasets show the great advantages of our proposed approach. In particular, we obtain substantial improvements with 19.78% in passage-level hallucination detection. We perform experiments on two datasets, namely the well-known Wiki Bio (Manakul, Liusie, and Gales 2023) and our constructed Note Sum. The results show the great superiority of our approach in both sentence-level and passage-level hallucination detection. We conduct elaborate analyses of the experimental results on two benchmark datasets, and provide a better understanding of the effectiveness of our approach. Ablation Studies: We conduct ablation studies on Wiki Bio with LLa MA-30B from three dimensions: token, sentence, and passage. Experimental results are shown in Table 3.
Researcher Affiliation Collaboration Kedi Chen1* , Qin Chen1* , Jie Zhou1, Xinqi Tao2, Bowen Ding2, Jingwen Xie2, Mingchen Xie2, Peilong Li2, Feng Zheng2 1East China Normal University 2Xiaohongshu Inc. EMAIL EMAIL EMAIL
Pseudocode No The paper describes the methodology in narrative text and uses formulas (e.g., U(tj i), Io, UE(i), UG(i), Us(i), Up) and an overall framework diagram (Figure 2). It does not contain a formally structured pseudocode or algorithm block.
Open Source Code No The paper does not contain an explicit statement about releasing code or a link to a code repository.
Open Datasets Yes We conduct extensive experiments on two datasets for hallucination detection. One is currently the latest and most widely used dataset Wiki Bio. Wiki Bio (Manakul, Liusie, and Gales 2023) is a dataset derived from Wikipedia biographies.
Dataset Splits No The paper describes the annotation of sentences within the datasets (Factual, Non Fact*, Non Fact) and provides statistics (Table 1) but does not specify how the datasets were split into training, validation, or test sets for the experiments.
Hardware Specification No The paper mentions using specific models like LLa MA-13B and LLa MA-30B and a DeBERTa-v3-Large NLI model, but it does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions several software tools and models, including a 'transition-based AMR parser (Xu, Lee, and Huang 2023)', 'spaCy' for coreference resolution and entity linking, 'DeBERTa-v3-Large (He, Gao, and Chen 2023) NLI model', and 'LLa MA-13B and LLa MA-30B models'. However, it does not specify version numbers for general software dependencies like Python, specific library versions, or the exact version of spaCy used.
Experiment Setup Yes The hyper-parameters α, β, λ, and k are set to 0.8, 0.65, 0.7, and 3 respectively.