TruthFlow: Truthful LLM Generation via Representation Flow Correction

Authors: Hanyu Wang, Bochuan Cao, Yuanpu Cao, Jinghui Chen

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Truth Flow significantly improves performance on openended generation tasks across various advanced LLMs evaluated on Truthful QA. Moreover, the trained Truth Flow model exhibits strong transferability, performing effectively on other unseen hallucination benchmarks.
Researcher Affiliation Academia 1College of Information Sciences and Technology, The Pennsylvania State University, State College, PA, USA. Correspondence to: Hanyu Wang <EMAIL>, Jinghui Chen <EMAIL>.
Pseudocode Yes Algorithm 1 Training Algorithm 2 Obtain Query-specific Directions Algorithm 3 Midpoint Method For Flow ODE
Open Source Code No The paper does not contain an explicit statement or link for the release of their own source code. It references a publicly released model for a baseline (GRATH) in footnote 2, but not for the main methodology of this paper.
Open Datasets Yes In order to measure the truthfulness of LLMs, we mainly consider Truthful QA (Lin et al., 2021). To assess the generalizability of our method, we apply Truth Flow which is trained on the entire Truthful QA dataset to Halu Eval (Li et al., 2023), Natrual Questions (Kwiatkowski et al., 2019) (NQ), and Trivia QA (Joshi et al., 2017). First, we train and test Truth Flow on the Med Hallu (Pandit et al., 2025) dataset to evaluate how truthful our method can achieve in medical domain.
Dataset Splits Yes Following the experiment settings of previous work (Zhang et al., 2024; Li et al., 2024), we divide the whole Truthful QA dataset into half: 408 data as the training set and 409 remaining data as the test set. Similar to the experimental setting to Section 4, we randomly select 408 data points from the pqa labeled subset as training data and 409 from the rest as the test set.
Hardware Specification Yes All the experiments are done on a single Nvidia RTX A6000 48GB GPU.
Software Dependencies Yes We use the gpt-4-0613 API.
Experiment Setup Yes We use Adam W optimizer with learning rate 10 4 and 100 steps cosine schedule warmup. The training batch size is set to 136 and the number of epochs is 25 by default. Table 7: Hyperparameters for Truth Flow across all LLMs used in our experiments. Model Num Epochs Layer α k Llama2-7B 25 12 3.0 20 Llama2-13B 45 13 1.8 20 Llama3 25 12 4.3 10 Mistral2 25 13 2.5 20 Mistral3 25 13 4.0 12 Gemma2 40 20 1.5 20