reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

G2LDetect: A Global-to-Local Approach for Hallucination Detection

Authors: Xiaoxia Cheng, Zeqi Tan, Zhe Zheng, Weiming Lu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our globalto-local method outperforms existing methods, especially for longer texts. Experimental results on hallucination benchmarks, including Halu Eval (Li et al. 2023), TRUE (Honovich et al. 2022), and datasets across different domains show that our approach outperforms previous methods, especially for longer texts.
Researcher Affiliation	Academia	College of Computer Science and Technology, Zhejiang University EMAIL
Pseudocode	No	The paper describes algorithms like the "path-wise identification algorithm" in text, but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code https://github.com/hustcxx/G2LDetect
Open Datasets	Yes	We conduct experiments on four hallucination detection datasets: Halu Eval-Summary in benchmark Halu Eval (Li et al. 2023), QAGS-CNNDM and FEVER in TRUE benchmark (Honovich et al. 2022), and SCIFACT (Wadden et al. 2022).
Dataset Splits	No	In our experiments, due to the resource limitations associated with using LLM, we sample a portion from each dataset, following previous methods (Wei et al. 2022). For the Halu Eval-Summary dataset, we extract a sample at a ratio of one-tenth. In the QAGS-CNNDM dataset, we only remove samples that contain sensitive vocabulary. For the FEVER dataset, we select only those samples where the text to be detected exceeds 20 tokens. For SCIFACT, we exclude samples from the original dataset where the reference documents are absent. The paper describes sampling and filtering criteria for the datasets, but it does not provide specific train/validation/test splits (e.g., percentages, sample counts, or references to predefined splits).
Hardware Specification	No	The paper mentions using specific LLMs like LLAMA3-8B-Instruct, Chat GPT (GPT-3.5-Turbo-0613), and GPT-4 (GPT-4-0613) for experiments, but it does not provide any specific hardware details such as GPU/CPU models, processor types, or memory used.
Software Dependencies	No	In our paper, the LLMs used in global representation and detection include LLAMA3-8B-Instruct (Meta 2024), Chat GPT (GPT-3.5-Turbo-0613) (Open AI 2022) and GPT-4 (GPT-4-0613) (Open AI 2024). To ensure reproducibility, for the parts involving the large model, we configure all models with the top p parameter as 1.0 and temperature as 0.0. The paper mentions the specific Large Language Models used and their configuration parameters, but it does not list other ancillary software dependencies like programming languages, libraries, or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	To ensure reproducibility, for the parts involving the large model, we configure all models with the top p parameter as 1.0 and temperature as 0.0. The baseline methods and our G2LDetect both adopt a zero-shot setting to counteract the potential randomness associated with demonstrations in a few-shot setting.