reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Knowledge Editing with Dynamic Knowledge Graphs for Multi-Hop Question Answering

Authors: Yifan Lu, Yigeng Zhou, Jing Li, Yequan Wang, Xuebo Liu, Daojing He, Fangming Liu, Min Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on benchmarks show that KEDKG surpasses previous state-of-the-art models, delivering more accurate and reliable answers in environments with dynamic information. We conduct extensive experiments across various LLMs and datasets to validate the effectiveness and usability of KEDKG. Our empirical results and analysis demonstrate that KEDKG significantly outperforms the advanced existing baselines, achieving superior performance.
Researcher Affiliation	Academia	1Harbin Institute of Technology, Shenzhen, China 2Beijing Academy of Artificial Intelligence, Beijing, China 3Pengcheng Laboratory, Shenzhen, China
Pseudocode	No	The paper describes the methodology using text and a system diagram (Figure 2), but does not contain a dedicated pseudocode or algorithm block.
Open Source Code	No	The paper does not contain an unambiguous statement or a direct link indicating the release of source code for the methodology described.
Open Datasets	Yes	We evaluate KEDKG using the MQu AKE dataset. MQu AKE is a knowledge editing benchmark for multi-hop QA, comprising MQu AKE-CF based on counterfactual editing and MQu AKE-T based on temporal knowledge updates. We use MQu AKE-CF as the training set, which contains 9,218 data points, and MQu AKE-CF-3k as the test set, which includes 3,000 data points.
Dataset Splits	Yes	We use MQu AKE-CF as the training set, which contains 9,218 data points, and MQu AKE-CF-3k as the test set, which includes 3,000 data points.
Hardware Specification	Yes	All our experiments are carried out on a NVIDIA 8 A800-SXM4-80G machine.
Software Dependencies	No	We train an entity detector and a relation detector based on the Distil BERT (Sanh 2019) model and fine-tune the Llama 2-7B model for the question decomposition task. In addition, we use REBEL (Cabot and Navigli 2021) as our relation extraction model and spacy entity linker as entity linking model. While specific models are named, explicit version numbers for Distil BERT, REBEL, and spacy entity linker are not provided.
Experiment Setup	Yes	If the highest probability p exceeds a threshold α, which is set to 0.5 in our experiments, we can retrieve the corresponding fact triple (s, r, o ) and use o as the retrieval answer. We train an entity detector and a relation detector based on the Distil BERT (Sanh 2019) model and fine-tune the Llama 2-7B model for the question decomposition task.