reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HaDeMiF: Hallucination Detection and Mitigation in Large Language Models

Authors: Xiaoling Zhou, Mingjie Zhang, Zhemg Lee, Wei Ye, Shikun Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments conclusively demonstrate the effectiveness of our framework in hallucination detection and model calibration across text generation tasks with responses of varying lengths.
Researcher Affiliation	Academia	Xiaoling Zhou1, Mingjie Zhang1, Zhemg Lee2, Wei Ye1, , & Shikun Zhang1, 1Peking University 2Tianjin University Corresponding to EMAIL; EMAIL.
Pseudocode	No	The paper describes the optimization procedure using mathematical formulas (4), (5), (6) but does not present it in a structured pseudocode or algorithm block.
Open Source Code	No	The paper mentions utilizing LoRA (Hu et al., 2022) with a GitHub link (https://github.com/microsoft/LoRA), but this refers to a third-party tool used, not the source code for the HADEMIF methodology described in this paper.
Open Datasets	Yes	Specifically, we utilize the CAT benchmark (Liu et al., 2024), which encompasses tasks with responses at the phrase, sentence, and paragraph levels. The phrase-level generation datasets include Natural Questions (NQ), Sci Q, and Trivia QA, each of which features short responses, such as named entities. For sentence-level responses, we consider Truthful QA and Wiki QA, where the model outputs full sentences. For paragraph-level tasks, we incorporate Bio Gen and Wiki Gen (Liu et al., 2024). 6https://github.com/google-research-datasets/natural-questions 7https://huggingface.co/datasets/allenai/sciq 8https://nlp.cs.washington.edu/triviaqa/ 9https://github.com/sylinrl/Truthful QA 10https://huggingface.co/datasets/microsoft/wiki_qa 11https://github.com/shmsw25/FAct Score 12https://github.com/awslabs/fever
Dataset Splits	Yes	For the three phrase-level tasks, 1K samples are used for testing and 2K samples for training. For Truthful QA, which lacks an official training set, 397 instances are randomly sampled from the original test set for training and the remaining instances are utilized for testing. For the Wiki QA dataset, the training set consists of 1,040 instances, while the test set contains 293 instances. For Bio Gen, a total of 683 names are compiled from (Min et al., 2023), of which 183 names are designated for evaluation and the remaining 500 are utilized for training. Similarly, for the Wiki Gen task, 600 entities are randomly selected from the FEVER12 dataset, each linked to a specific Wikipedia passage. Of these, 100 entities were set aside for evaluation, while the remaining 500 were utilized for training.
Hardware Specification	No	To facilitate efficient fine-tuning of the LLMs, we utilize Lo RA (Hu et al., 2022), which enables the fine-tuning process to be conducted on a single GPU. The paper does not specify the model or type of GPU, CPU, or any other hardware component.
Software Dependencies	No	The paper mentions using Lo RA, but does not provide a specific version number for LoRA or any other key software libraries, frameworks, or programming languages with their versions.
Experiment Setup	Yes	The training process begins with an initial learning rate of 1 10 3 for both the MLP and D3T networks, which is reduced by a factor of 0.1 at the 20th and 40th epochs. Training is conducted in 50 epochs with early stopping. For fine-tuning the LLMs, the two hallucination detection networks are first trained for 40 epochs, after which an alternating optimization process is applied between the LLMs and the two detection networks. The LLMs are fine-tuned for 5 epochs using Lo RA5 with a rank of 8 and a learning rate of 3 10 4. The MLP network is initialized using He initialization He et al. (2015)... For the D3T model... all parameters are initialized using Xavier initialization Glorot & Bengio (2010) with a uniform distribution.