Integrative Decoding: Improving Factuality via Implicit Self-consistency
Authors: Yi Cheng, Xiao Liang, Yeyun Gong, Wen Xiao, Song Wang, Yuji Zhang, Wenjun Hou, Kaishuai Xu, Wenge Liu, Wenjie Li, Jian Jiao, Qi Chen, Peng CHENG, Wayne Xiong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluation shows that ID consistently enhances factuality over a wide range of language models, with substantial improvements on the Truthful QA (+11.2%), Biographies (+15.4%) and Long Fact (+8.5%) benchmarks. We evaluate ID over six series of LLMs with varying scales. |
| Researcher Affiliation | Collaboration | 1The Hong Kong Polytechnic University 2Tsinghua University 3Microsoft Research 4Microsoft Azure AI 5University of Illinois at Urbana-Champaign |
| Pseudocode | No | The paper describes the workflow of integrative decoding through a diagram (Figure 1) and mathematical equations (e.g., Equation 8) and prose, but does not include a distinct pseudocode or algorithm block. |
| Open Source Code | Yes | All codes and data are available at https://github.com/YiCheng98/IntegrativeDecoding. |
| Open Datasets | Yes | Truthful QA (Lin et al., 2022) consists of 817 questions... Biographies (Du et al., 2024) requires generating bullet point biographies... Long Fact-Objects (Wei et al., 2024) requests detailed descriptions... |
| Dataset Splits | Yes | We split Truthful QA into 410 samples for testing and 407 samples for validation, and divided Biographies into 128 samples for evaluation and 122 samples for validation. |
| Hardware Specification | Yes | The experiments that involves larger model scales than 13B (Figure 5) were conducted on 4 GPUs of H100 80G. All other experiments were conducted on a single GPU of A100 80GB. |
| Software Dependencies | No | The paper mentions using a 'Transformer library' and models like 'GPT-4' and 'LLaMA3.1-70B-Instruct', but does not provide specific version numbers for general software dependencies such as Python, PyTorch, or the Transformer library itself. |
| Experiment Setup | Yes | The sampled responses were all obtained via temperature sampling with T = 0.7 when implementing USC, SR, and ID in the main experiments. For USC, SR, and ID, we searched for the optimal number of sampled responses to integrate from k = {1, 4, 8, 12, 16}... In Section 3.5: We configure the number of sampled responses to 4 and the batch size to 64. |