G2LDetect: A Global-to-Local Approach for Hallucination Detection
Authors: Xiaoxia Cheng, Zeqi Tan, Zhe Zheng, Weiming Lu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our globalto-local method outperforms existing methods, especially for longer texts. Experimental results on hallucination benchmarks, including Halu Eval (Li et al. 2023), TRUE (Honovich et al. 2022), and datasets across different domains show that our approach outperforms previous methods, especially for longer texts. |
| Researcher Affiliation | Academia | College of Computer Science and Technology, Zhejiang University EMAIL |
| Pseudocode | No | The paper describes algorithms like the "path-wise identification algorithm" in text, but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | Code https://github.com/hustcxx/G2LDetect |
| Open Datasets | Yes | We conduct experiments on four hallucination detection datasets: Halu Eval-Summary in benchmark Halu Eval (Li et al. 2023), QAGS-CNNDM and FEVER in TRUE benchmark (Honovich et al. 2022), and SCIFACT (Wadden et al. 2022). |
| Dataset Splits | No | In our experiments, due to the resource limitations associated with using LLM, we sample a portion from each dataset, following previous methods (Wei et al. 2022). For the Halu Eval-Summary dataset, we extract a sample at a ratio of one-tenth. In the QAGS-CNNDM dataset, we only remove samples that contain sensitive vocabulary. For the FEVER dataset, we select only those samples where the text to be detected exceeds 20 tokens. For SCIFACT, we exclude samples from the original dataset where the reference documents are absent. The paper describes sampling and filtering criteria for the datasets, but it does not provide specific train/validation/test splits (e.g., percentages, sample counts, or references to predefined splits). |
| Hardware Specification | No | The paper mentions using specific LLMs like LLAMA3-8B-Instruct, Chat GPT (GPT-3.5-Turbo-0613), and GPT-4 (GPT-4-0613) for experiments, but it does not provide any specific hardware details such as GPU/CPU models, processor types, or memory used. |
| Software Dependencies | No | In our paper, the LLMs used in global representation and detection include LLAMA3-8B-Instruct (Meta 2024), Chat GPT (GPT-3.5-Turbo-0613) (Open AI 2022) and GPT-4 (GPT-4-0613) (Open AI 2024). To ensure reproducibility, for the parts involving the large model, we configure all models with the top p parameter as 1.0 and temperature as 0.0. The paper mentions the specific Large Language Models used and their configuration parameters, but it does not list other ancillary software dependencies like programming languages, libraries, or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | To ensure reproducibility, for the parts involving the large model, we configure all models with the top p parameter as 1.0 and temperature as 0.0. The baseline methods and our G2LDetect both adopt a zero-shot setting to counteract the potential randomness associated with demonstrations in a few-shot setting. |