reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Multi-Hop Fact Verification with Structured Knowledge-Augmented Large Language Models

Authors: Han Cao, Lingwei Wei, Wei Zhou, Songlin Hu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on several common-used multihop fact verification datasets FEVER (Thorne et al. 2018b), and HOVER (Jiang et al. 2020) to assess the effectiveness of LLM-SKAN. The experimental results on four common-used datasets demonstrate the effectiveness and superiority of our model.
Researcher Affiliation	Academia	Han Cao1,2, Lingwei Wei1*, Wei Zhou1, Songlin Hu1, 2 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods using natural language descriptions and mathematical equations, such as for the LLM-driven Knowledge Extractor prompt, graph neural network updates, and classification, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1The code will be released in https://github.com/Han Cao12/LLM-SKAN
Open Datasets	Yes	To evaluate the effectiveness of LLM-SKAN for both singlehop and multi-hop fact verification tasks, we choose 4 public benchmarks, FEVER (Thorne et al. 2018b) and 2-, 3, and 4-hop HOVER (Jiang et al. 2020), to conduct experiments.
Dataset Splits	Yes	The statistics are shown in Table 2. Dataset Train Dev Test FEVER 145,449 19,998 19,998 2-hop HOVER 9,052 1,126 1,333 3-hop HOVER 6,084 1,835 1,333 4-hop HOVER 33,035 1,039 1,333
Hardware Specification	Yes	We use a Tesla V100-PCIE GPU with 32GB memory for all experiments and implement our model via the Pytorch framework.
Software Dependencies	No	The paper mentions implementing the model via the 'Pytorch framework' and fine-tuning 'Llama2-7b' but does not specify version numbers for these or any other software components.
Experiment Setup	Yes	The number of attention heads is set to 8. The batch size is 24. We set the learning rate as 2e-4. To keep consistency, we set the number of nodes of each relation graph to the maximum 20.