Integrative Decoding: Improving Factuality via Implicit Self-consistency

Authors: Yi Cheng, Xiao Liang, Yeyun Gong, Wen Xiao, Song Wang, Yuji Zhang, Wenjun Hou, Kaishuai Xu, Wenge Liu, Wenjie Li, Jian Jiao, Qi Chen, Peng CHENG, Wayne Xiong

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluation shows that ID consistently enhances factuality over a wide range of language models, with substantial improvements on the Truthful QA (+11.2%), Biographies (+15.4%) and Long Fact (+8.5%) benchmarks. We evaluate ID over six series of LLMs with varying scales.
Researcher Affiliation Collaboration 1The Hong Kong Polytechnic University 2Tsinghua University 3Microsoft Research 4Microsoft Azure AI 5University of Illinois at Urbana-Champaign
Pseudocode No The paper describes the workflow of integrative decoding through a diagram (Figure 1) and mathematical equations (e.g., Equation 8) and prose, but does not include a distinct pseudocode or algorithm block.
Open Source Code Yes All codes and data are available at https://github.com/YiCheng98/IntegrativeDecoding.
Open Datasets Yes Truthful QA (Lin et al., 2022) consists of 817 questions... Biographies (Du et al., 2024) requires generating bullet point biographies... Long Fact-Objects (Wei et al., 2024) requests detailed descriptions...
Dataset Splits Yes We split Truthful QA into 410 samples for testing and 407 samples for validation, and divided Biographies into 128 samples for evaluation and 122 samples for validation.
Hardware Specification Yes The experiments that involves larger model scales than 13B (Figure 5) were conducted on 4 GPUs of H100 80G. All other experiments were conducted on a single GPU of A100 80GB.
Software Dependencies No The paper mentions using a 'Transformer library' and models like 'GPT-4' and 'LLaMA3.1-70B-Instruct', but does not provide specific version numbers for general software dependencies such as Python, PyTorch, or the Transformer library itself.
Experiment Setup Yes The sampled responses were all obtained via temperature sampling with T = 0.7 when implementing USC, SR, and ID in the main experiments. For USC, SR, and ID, we searched for the optimal number of sampled responses to integrate from k = {1, 4, 8, 12, 16}... In Section 3.5: We configure the number of sampled responses to 4 and the batch size to 64.