MRR-FV: Unlocking Complex Fact Verification with Multi-Hop Retrieval and Reasoning

Authors: Liwen Zheng, Chaozhuo Li, Litian Zhang, Haoran Jia, Senzhang Wang, Zheng Liu, Xi Zhang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluations on the FEVER and HOVER datasets demonstrate the superior performance of our model in both claim verification and evidence retrieval tasks. Experimental results on two datasets demonstrate the superiority of our approach. Experiments Experimental Setup This section describes the dataset, evaluation metrics, and baselines of our experiments. Overall Performance Table 1 presents the performance results of our proposed model MRR-FV on the FEVER dataset, compared to the baselines for fact verification. Ablation Study As illustrated in Table 6, we design ablation studies to verify the effectiveness of core modules.
Researcher Affiliation Collaboration 1Key Laboratory of Trustworthy Distributed Computing and Service (Mo E), Beijing University of Posts and Telecommunications, China 2Beijing University of Aeronautics and Astronautics, Beijing 100191, China 3Central South University, China 4BAAI, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about providing source code, nor does it include links to a code repository or mention code in supplementary materials.
Open Datasets Yes We conduct our evaluations using the large-scale dataset FEVER (Thorne et al. 2018) and HOVER (Jiang et al. 2020), a multi-hop fact-verification dataset.
Dataset Splits Yes Besides, we use the dev set of HOVER for evaluation since the test sets are not publicly released. All claims in both FEVER and HOVER are classified by annotators as Supports, Refutes, or Not Enough Info. Following previous studies, we employ Label Accuracy (LA) and the FEVER score as evaluation metrics for claim verification on the FEVER dataset (Hanselowski et al. 2018; Liu et al. 2020). For HOVER, we use Macro-F1 scores for claim verification, and F1-score for evidence retrieval. As depicted in Table 2, following Program FC, we divide the HOVER validation set into three subsets based on the number of hops required to verify the claim. On the dev set of HOVER, MRR-FV outperforms the SOTA baseline by 1.9%, 2.6%, and 3.9% on the two-hop, three-hop, and fourhop subsets, respectively.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions the use of 'the autoregressive generative model BART (Lewis et al. 2020)' but does not provide specific version numbers for any software or libraries.
Experiment Setup Yes Hyperparameter Sensitivity Analysis As depicted in Figure 3, Length and Hops dictate the maximum length of the compressed query and the pre-set number of hops for multi-hop retrieval, respectively. The experimental results in Figure 3(a) suggest that both excessively long and overly short compressed text can lead to a decline in performance. As depicted in Figure 3(b), the evaluation metric on the HOVER dataset is the average Macro-F1 across the three subsets. Similarly, an excessively high or low number of hops can also lead to a decline in performance.