reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TruthFlow: Truthful LLM Generation via Representation Flow Correction

Authors: Hanyu Wang, Bochuan Cao, Yuanpu Cao, Jinghui Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that Truth Flow significantly improves performance on openended generation tasks across various advanced LLMs evaluated on Truthful QA. Moreover, the trained Truth Flow model exhibits strong transferability, performing effectively on other unseen hallucination benchmarks.
Researcher Affiliation	Academia	1College of Information Sciences and Technology, The Pennsylvania State University, State College, PA, USA. Correspondence to: Hanyu Wang <EMAIL>, Jinghui Chen <EMAIL>.
Pseudocode	Yes	Algorithm 1 Training Algorithm 2 Obtain Query-specific Directions Algorithm 3 Midpoint Method For Flow ODE
Open Source Code	No	The paper does not contain an explicit statement or link for the release of their own source code. It references a publicly released model for a baseline (GRATH) in footnote 2, but not for the main methodology of this paper.
Open Datasets	Yes	In order to measure the truthfulness of LLMs, we mainly consider Truthful QA (Lin et al., 2021). To assess the generalizability of our method, we apply Truth Flow which is trained on the entire Truthful QA dataset to Halu Eval (Li et al., 2023), Natrual Questions (Kwiatkowski et al., 2019) (NQ), and Trivia QA (Joshi et al., 2017). First, we train and test Truth Flow on the Med Hallu (Pandit et al., 2025) dataset to evaluate how truthful our method can achieve in medical domain.
Dataset Splits	Yes	Following the experiment settings of previous work (Zhang et al., 2024; Li et al., 2024), we divide the whole Truthful QA dataset into half: 408 data as the training set and 409 remaining data as the test set. Similar to the experimental setting to Section 4, we randomly select 408 data points from the pqa labeled subset as training data and 409 from the rest as the test set.
Hardware Specification	Yes	All the experiments are done on a single Nvidia RTX A6000 48GB GPU.
Software Dependencies	Yes	We use the gpt-4-0613 API.
Experiment Setup	Yes	We use Adam W optimizer with learning rate 10 4 and 100 steps cosine schedule warmup. The training batch size is set to 136 and the number of epochs is 25 by default. Table 7: Hyperparameters for Truth Flow across all LLMs used in our experiments. Model Num Epochs Layer α k Llama2-7B 25 12 3.0 20 Llama2-13B 45 13 1.8 20 Llama3 25 12 4.3 10 Mistral2 25 13 2.5 20 Mistral3 25 13 4.0 12 Gemma2 40 20 1.5 20