Enhancing Uncertainty Modeling with Semantic Graph for Hallucination Detection
Authors: Kedi Chen, Qin Chen, Jie Zhou, Xinqi Tao, Bowen Ding, Jingwen Xie, Mingchen Xie, Peilong Li, Zheng Feng
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on two datasets show the great advantages of our proposed approach. In particular, we obtain substantial improvements with 19.78% in passage-level hallucination detection. We perform experiments on two datasets, namely the well-known Wiki Bio (Manakul, Liusie, and Gales 2023) and our constructed Note Sum. The results show the great superiority of our approach in both sentence-level and passage-level hallucination detection. We conduct elaborate analyses of the experimental results on two benchmark datasets, and provide a better understanding of the effectiveness of our approach. Ablation Studies: We conduct ablation studies on Wiki Bio with LLa MA-30B from three dimensions: token, sentence, and passage. Experimental results are shown in Table 3. |
| Researcher Affiliation | Collaboration | Kedi Chen1* , Qin Chen1* , Jie Zhou1, Xinqi Tao2, Bowen Ding2, Jingwen Xie2, Mingchen Xie2, Peilong Li2, Feng Zheng2 1East China Normal University 2Xiaohongshu Inc. EMAIL EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology in narrative text and uses formulas (e.g., U(tj i), Io, UE(i), UG(i), Us(i), Up) and an overall framework diagram (Figure 2). It does not contain a formally structured pseudocode or algorithm block. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing code or a link to a code repository. |
| Open Datasets | Yes | We conduct extensive experiments on two datasets for hallucination detection. One is currently the latest and most widely used dataset Wiki Bio. Wiki Bio (Manakul, Liusie, and Gales 2023) is a dataset derived from Wikipedia biographies. |
| Dataset Splits | No | The paper describes the annotation of sentences within the datasets (Factual, Non Fact*, Non Fact) and provides statistics (Table 1) but does not specify how the datasets were split into training, validation, or test sets for the experiments. |
| Hardware Specification | No | The paper mentions using specific models like LLa MA-13B and LLa MA-30B and a DeBERTa-v3-Large NLI model, but it does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions several software tools and models, including a 'transition-based AMR parser (Xu, Lee, and Huang 2023)', 'spaCy' for coreference resolution and entity linking, 'DeBERTa-v3-Large (He, Gao, and Chen 2023) NLI model', and 'LLa MA-13B and LLa MA-30B models'. However, it does not specify version numbers for general software dependencies like Python, specific library versions, or the exact version of spaCy used. |
| Experiment Setup | Yes | The hyper-parameters α, β, λ, and k are set to 0.8, 0.65, 0.7, and 3 respectively. |