G2LDetect: A Global-to-Local Approach for Hallucination Detection

Authors: Xiaoxia Cheng, Zeqi Tan, Zhe Zheng, Weiming Lu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our globalto-local method outperforms existing methods, especially for longer texts. Experimental results on hallucination benchmarks, including Halu Eval (Li et al. 2023), TRUE (Honovich et al. 2022), and datasets across different domains show that our approach outperforms previous methods, especially for longer texts.
Researcher Affiliation Academia College of Computer Science and Technology, Zhejiang University EMAIL
Pseudocode No The paper describes algorithms like the "path-wise identification algorithm" in text, but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Code https://github.com/hustcxx/G2LDetect
Open Datasets Yes We conduct experiments on four hallucination detection datasets: Halu Eval-Summary in benchmark Halu Eval (Li et al. 2023), QAGS-CNNDM and FEVER in TRUE benchmark (Honovich et al. 2022), and SCIFACT (Wadden et al. 2022).
Dataset Splits No In our experiments, due to the resource limitations associated with using LLM, we sample a portion from each dataset, following previous methods (Wei et al. 2022). For the Halu Eval-Summary dataset, we extract a sample at a ratio of one-tenth. In the QAGS-CNNDM dataset, we only remove samples that contain sensitive vocabulary. For the FEVER dataset, we select only those samples where the text to be detected exceeds 20 tokens. For SCIFACT, we exclude samples from the original dataset where the reference documents are absent. The paper describes sampling and filtering criteria for the datasets, but it does not provide specific train/validation/test splits (e.g., percentages, sample counts, or references to predefined splits).
Hardware Specification No The paper mentions using specific LLMs like LLAMA3-8B-Instruct, Chat GPT (GPT-3.5-Turbo-0613), and GPT-4 (GPT-4-0613) for experiments, but it does not provide any specific hardware details such as GPU/CPU models, processor types, or memory used.
Software Dependencies No In our paper, the LLMs used in global representation and detection include LLAMA3-8B-Instruct (Meta 2024), Chat GPT (GPT-3.5-Turbo-0613) (Open AI 2022) and GPT-4 (GPT-4-0613) (Open AI 2024). To ensure reproducibility, for the parts involving the large model, we configure all models with the top p parameter as 1.0 and temperature as 0.0. The paper mentions the specific Large Language Models used and their configuration parameters, but it does not list other ancillary software dependencies like programming languages, libraries, or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes To ensure reproducibility, for the parts involving the large model, we configure all models with the top p parameter as 1.0 and temperature as 0.0. The baseline methods and our G2LDetect both adopt a zero-shot setting to counteract the potential randomness associated with demonstrations in a few-shot setting.