HaDeMiF: Hallucination Detection and Mitigation in Large Language Models
Authors: Xiaoling Zhou, Mingjie Zhang, Zhemg Lee, Wei Ye, Shikun Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conclusively demonstrate the effectiveness of our framework in hallucination detection and model calibration across text generation tasks with responses of varying lengths. |
| Researcher Affiliation | Academia | Xiaoling Zhou1, Mingjie Zhang1, Zhemg Lee2, Wei Ye1, , & Shikun Zhang1, 1Peking University 2Tianjin University Corresponding to EMAIL; EMAIL. |
| Pseudocode | No | The paper describes the optimization procedure using mathematical formulas (4), (5), (6) but does not present it in a structured pseudocode or algorithm block. |
| Open Source Code | No | The paper mentions utilizing LoRA (Hu et al., 2022) with a GitHub link (https://github.com/microsoft/LoRA), but this refers to a third-party tool used, not the source code for the HADEMIF methodology described in this paper. |
| Open Datasets | Yes | Specifically, we utilize the CAT benchmark (Liu et al., 2024), which encompasses tasks with responses at the phrase, sentence, and paragraph levels. The phrase-level generation datasets include Natural Questions (NQ), Sci Q, and Trivia QA, each of which features short responses, such as named entities. For sentence-level responses, we consider Truthful QA and Wiki QA, where the model outputs full sentences. For paragraph-level tasks, we incorporate Bio Gen and Wiki Gen (Liu et al., 2024). 6https://github.com/google-research-datasets/natural-questions 7https://huggingface.co/datasets/allenai/sciq 8https://nlp.cs.washington.edu/triviaqa/ 9https://github.com/sylinrl/Truthful QA 10https://huggingface.co/datasets/microsoft/wiki_qa 11https://github.com/shmsw25/FAct Score 12https://github.com/awslabs/fever |
| Dataset Splits | Yes | For the three phrase-level tasks, 1K samples are used for testing and 2K samples for training. For Truthful QA, which lacks an official training set, 397 instances are randomly sampled from the original test set for training and the remaining instances are utilized for testing. For the Wiki QA dataset, the training set consists of 1,040 instances, while the test set contains 293 instances. For Bio Gen, a total of 683 names are compiled from (Min et al., 2023), of which 183 names are designated for evaluation and the remaining 500 are utilized for training. Similarly, for the Wiki Gen task, 600 entities are randomly selected from the FEVER12 dataset, each linked to a specific Wikipedia passage. Of these, 100 entities were set aside for evaluation, while the remaining 500 were utilized for training. |
| Hardware Specification | No | To facilitate efficient fine-tuning of the LLMs, we utilize Lo RA (Hu et al., 2022), which enables the fine-tuning process to be conducted on a single GPU. The paper does not specify the model or type of GPU, CPU, or any other hardware component. |
| Software Dependencies | No | The paper mentions using Lo RA, but does not provide a specific version number for LoRA or any other key software libraries, frameworks, or programming languages with their versions. |
| Experiment Setup | Yes | The training process begins with an initial learning rate of 1 10 3 for both the MLP and D3T networks, which is reduced by a factor of 0.1 at the 20th and 40th epochs. Training is conducted in 50 epochs with early stopping. For fine-tuning the LLMs, the two hallucination detection networks are first trained for 40 epochs, after which an alternating optimization process is applied between the LLMs and the two detection networks. The LLMs are fine-tuned for 5 epochs using Lo RA5 with a rank of 8 and a learning rate of 3 10 4. The MLP network is initialized using He initialization He et al. (2015)... For the D3T model... all parameters are initialized using Xavier initialization Glorot & Bengio (2010) with a uniform distribution. |