reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BiDeV: Bilateral Defusing Verification for Complex Claim Fact-Checking

Authors: Yuxuan Liu, Hongda Sun, Wenya Guo, Xinyan Xiao, Cunli Mao, Zhengtao Yu, Rui Yan

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results on two widely used challenging fact-checking benchmarks (Hover and Feverous-s) demonstrate that our Bi De V can achieve the best performance under both gold and open settings.
Researcher Affiliation	Collaboration	1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Nankai University 3 Baidu Inc. 4 Kunming University of Science and Technology EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The overview of our Bi De V is shown in Figure 2. In the subsequent sections, we will introduce how to integrate LLMs to eliminate the vagueness in the claim and the redundancy in the evidence. (Figure 2 is a diagram, not pseudocode). Figure 7: Case Study of selected baselines (FOLK and Program FC) and our Bi De V. (The pseudocode-like structures in Figure 7 are for baselines, not Bi De V's core algorithm).
Open Source Code	Yes	Code https://github.com/Ethan Leo-LYX/Bi De V
Open Datasets	Yes	Datasets. There are two widely used and challenging datasets to evaluate the fact-checking performance of baselines and our Bi De V: (i) Hover (Jiang et al. 2020) and (ii) Feverous-s (Pan et al. 2023).
Dataset Splits	Yes	Datasets. There are two widely used and challenging datasets to evaluate the fact-checking performance of baselines and our Bi De V: (i) Hover (Jiang et al. 2020) and (ii) Feverous-s (Pan et al. 2023).
Hardware Specification	No	In our proposed method, we use gpt-3.5-turbo as the base model of Perceptor, Rewriter, Decomposer, and Filter by accessing to Open AI API with few-shot demonstrations. For a fair comparison, we leverage Flan-T5-XL (3B) as the Querier and Checker without additional fine-tuning. The paper does not provide specific hardware details like GPU/CPU models.
Software Dependencies	No	In our proposed method, we use gpt-3.5-turbo as the base model of Perceptor, Rewriter, Decomposer, and Filter by accessing to Open AI API with few-shot demonstrations. For a fair comparison, we leverage Flan-T5-XL (3B) as the Querier and Checker without additional fine-tuning. The paper does not specify software versions for reproducibility.
Experiment Setup	Yes	In the vagueness defusing, we iteratively perceive-then-rewrite for 3 rounds.