reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MRR-FV: Unlocking Complex Fact Verification with Multi-Hop Retrieval and Reasoning

Authors: Liwen Zheng, Chaozhuo Li, Litian Zhang, Haoran Jia, Senzhang Wang, Zheng Liu, Xi Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental evaluations on the FEVER and HOVER datasets demonstrate the superior performance of our model in both claim verification and evidence retrieval tasks. Experimental results on two datasets demonstrate the superiority of our approach. Experiments Experimental Setup This section describes the dataset, evaluation metrics, and baselines of our experiments. Overall Performance Table 1 presents the performance results of our proposed model MRR-FV on the FEVER dataset, compared to the baselines for fact verification. Ablation Study As illustrated in Table 6, we design ablation studies to verify the effectiveness of core modules.
Researcher Affiliation	Collaboration	1Key Laboratory of Trustworthy Distributed Computing and Service (Mo E), Beijing University of Posts and Telecommunications, China 2Beijing University of Aeronautics and Astronautics, Beijing 100191, China 3Central South University, China 4BAAI, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using textual explanations and mathematical equations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about providing source code, nor does it include links to a code repository or mention code in supplementary materials.
Open Datasets	Yes	We conduct our evaluations using the large-scale dataset FEVER (Thorne et al. 2018) and HOVER (Jiang et al. 2020), a multi-hop fact-verification dataset.
Dataset Splits	Yes	Besides, we use the dev set of HOVER for evaluation since the test sets are not publicly released. All claims in both FEVER and HOVER are classified by annotators as Supports, Refutes, or Not Enough Info. Following previous studies, we employ Label Accuracy (LA) and the FEVER score as evaluation metrics for claim verification on the FEVER dataset (Hanselowski et al. 2018; Liu et al. 2020). For HOVER, we use Macro-F1 scores for claim verification, and F1-score for evidence retrieval. As depicted in Table 2, following Program FC, we divide the HOVER validation set into three subsets based on the number of hops required to verify the claim. On the dev set of HOVER, MRR-FV outperforms the SOTA baseline by 1.9%, 2.6%, and 3.9% on the two-hop, three-hop, and fourhop subsets, respectively.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions the use of 'the autoregressive generative model BART (Lewis et al. 2020)' but does not provide specific version numbers for any software or libraries.
Experiment Setup	Yes	Hyperparameter Sensitivity Analysis As depicted in Figure 3, Length and Hops dictate the maximum length of the compressed query and the pre-set number of hops for multi-hop retrieval, respectively. The experimental results in Figure 3(a) suggest that both excessively long and overly short compressed text can lead to a decline in performance. As depicted in Figure 3(b), the evaluation metric on the HOVER dataset is the average Macro-F1 across the three subsets. Similarly, an excessively high or low number of hops can also lead to a decline in performance.