reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Integrative Decoding: Improving Factuality via Implicit Self-consistency

Authors: Yi Cheng, Xiao Liang, Yeyun Gong, Wen Xiao, Song Wang, Yuji Zhang, Wenjun Hou, Kaishuai Xu, Wenge Liu, Wenjie Li, Jian Jiao, Qi Chen, Peng CHENG, Wayne Xiong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluation shows that ID consistently enhances factuality over a wide range of language models, with substantial improvements on the Truthful QA (+11.2%), Biographies (+15.4%) and Long Fact (+8.5%) benchmarks. We evaluate ID over six series of LLMs with varying scales.
Researcher Affiliation	Collaboration	1The Hong Kong Polytechnic University 2Tsinghua University 3Microsoft Research 4Microsoft Azure AI 5University of Illinois at Urbana-Champaign
Pseudocode	No	The paper describes the workflow of integrative decoding through a diagram (Figure 1) and mathematical equations (e.g., Equation 8) and prose, but does not include a distinct pseudocode or algorithm block.
Open Source Code	Yes	All codes and data are available at https://github.com/YiCheng98/IntegrativeDecoding.
Open Datasets	Yes	Truthful QA (Lin et al., 2022) consists of 817 questions... Biographies (Du et al., 2024) requires generating bullet point biographies... Long Fact-Objects (Wei et al., 2024) requests detailed descriptions...
Dataset Splits	Yes	We split Truthful QA into 410 samples for testing and 407 samples for validation, and divided Biographies into 128 samples for evaluation and 122 samples for validation.
Hardware Specification	Yes	The experiments that involves larger model scales than 13B (Figure 5) were conducted on 4 GPUs of H100 80G. All other experiments were conducted on a single GPU of A100 80GB.
Software Dependencies	No	The paper mentions using a 'Transformer library' and models like 'GPT-4' and 'LLaMA3.1-70B-Instruct', but does not provide specific version numbers for general software dependencies such as Python, PyTorch, or the Transformer library itself.
Experiment Setup	Yes	The sampled responses were all obtained via temperature sampling with T = 0.7 when implementing USC, SR, and ID in the main experiments. For USC, SR, and ID, we searched for the optimal number of sampled responses to integrate from k = {1, 4, 8, 12, 16}... In Section 3.5: We configure the number of sampled responses to 4 and the batch size to 64.