reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Steer LLM Latents for Hallucination Detection

Authors: Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications. ... Extensive experiments demonstrate the strong performance of our method across diverse datasets. ... In Table 1, we compare TSV with competitive hallucination detection methods from the literature. ... 5.3. Ablation studies
Researcher Affiliation	Academia	1Department of Computer Sciences, University of Wisconsin Madison 2School of Software Technology, Zhejiang University. Correspondence to: Yixuan Li <EMAIL>.
Pseudocode	Yes	A Algorithms A.1. Overall training framework Algorithm 1 Overall training framework A.2. Sinkhorn algorithm Algorithm 2 Sinkhorn algorithm for entropic-regularized optimal transport
Open Source Code	Yes	Code is available at: https: //github.com/deeplearning-wisc/tsv.
Open Datasets	Yes	We evaluate our method on four generative question-answering (QA) tasks: three open-domain QA datasets Truthful QA (Lin et al., 2022a), Trivia QA (Joshi et al., 2017), and NQ Open (Kwiatkowski et al., 2019); and a domain-specific QA dataset Sci Q (Welbl et al., 2017). ... Truthful QA4 (Lin et al., 2022a), Trivia QA5 (Joshi et al., 2017), Sci Q6 (Welbl et al., 2017), and NQ Open7 (Kwiatkowski et al., 2019).
Dataset Splits	Yes	For evaluation, 25% of the QA pairs from each dataset are reserved for testing. Consistent with Du et al. (2024), 100 QA pairs are used for validation, while the remaining samples simulate the unlabeled training dataset.
Hardware Specification	Yes	We conducted all experiments using Python 3.8.15 and Py Torch 2.3.1 (Paszke et al., 2019) on NVIDIA A100 GPUs.
Software Dependencies	Yes	We conducted all experiments using Python 3.8.15 and Py Torch 2.3.1 (Paszke et al., 2019) on NVIDIA A100 GPUs.
Experiment Setup	Yes	Class prototypes µc and TSV v are randomly initialized, and trained in two stages: 20 epochs using only the exemplar set, followed by an additional 20 epochs after augmentation. Training is performed using the Adam W optimizer (Loshchilov, 2019), with a learning rate of 5e-03 and a batch size of 128. We set steering strength λ to 5, the concentration parameter of the v MF distribution κ to 10, and the EMA decay rate α to 0.99. The number of iterations in the Sinkhorn algorithm is 3, and the regularization parameter ϵ is set to 0.05, following Caron et al. (2020).