Hallucination Detection on a Budget: Efficient Bayesian Estimation of Semantic Entropy

Authors: Kamil Ciosek, Nicolò Felicioni, Sina Ghiassian

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate empirically that our approach systematically beats the baselines, requiring only 53% of samples used by Farquhar et al. (2024) to achieve the same quality of hallucination detection as measured by AUROC.
Researcher Affiliation Industry Kamil Ciosek EMAIL Spotify Nicolò Felicioni EMAIL Spotify Sina Ghiassian EMAIL Spotify
Pseudocode Yes We summarize the ideas introduced in Sections 3.1, 3.2 and 3.3 in Algorithm 1. Algorithm 1 Estimate of Semantic Entropy for a prompt x.
Open Source Code No We will release the source code for both stages upon acceptance.
Open Datasets Yes We used the Trivia QA (Joshi et al., 2017), SQUAD (Rajpurkar et al., 2016), SVAMP (Patel et al., 2021) and NQ (Lee et al., 2019) datasets.
Dataset Splits Yes We use the first 200 prompts from each derivative dataset as the training set and the remaining 800 as the test set.
Hardware Specification Yes The computation stage that does inference in the LLM (which takes over a week on a single A100 80GB) is separated from the stage that estimates semantic entropy (which only uses the CPU, taking on the order of 12 minutes).
Software Dependencies No The paper mentions 'quantization settings' for LLMs (8 bit for Llama-3.3-70B, 16 bit for Mistral, 32 bit for Llama-3.2 and Llama-2) but does not list specific software dependencies like programming languages, libraries, or frameworks with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Following the methodology of Farquhar et al. (2024), the N LLM responses are generated with temperature 1.0. On the other hand, the LLM response about which we seek to determine if it is a hallucination is generated with temperature 0.1.