reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Knowledge Editing for Multi-Hop Question Answering Using Semantic Analysis

Authors: Dominic Simon, Rickard Ewetz

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the effectiveness of CHECK against five state-of-the-art frameworks on four datasets and achieve an average 22.8% improved MQA accuracy.
Researcher Affiliation	Academia	Dominic Simon , Rickard Ewetz University of Florida EMAIL
Pseudocode	No	The paper describes the methodology of the CHECK framework in Section 4, detailing steps like type extraction, question decomposition, and subquestion resolution, and includes flow diagrams (Figure 1, 2, 3). However, it does not present any explicit pseudocode blocks or algorithm listings.
Open Source Code	Yes	The code for CHECK is available at https://github.com/dominic-simon/CHECK.
Open Datasets	Yes	We use the MQu AKE [Zhong et al., 2023] dataset to evaluate the editors.
Dataset Splits	No	The paper describes the composition of the MQuAKE dataset and its subsets (e.g., "The counterfactual subset contains 3000 edit cases... The temporal subset is composed of 1868 edit cases...") but does not explicitly provide information about how these datasets were split into training, validation, or test sets for the experiments conducted in this paper.
Hardware Specification	Yes	All experiments were conducted on 1 NVIDIA A100 GPU and 8 CPU cores.
Software Dependencies	No	The paper mentions several models and frameworks (e.g., ReFinED entity linking model, Contriever dense retrieval model, GPT-J, Vicuna-7B, Falcon-7B) but does not provide specific version numbers for any software libraries, programming languages, or environments used to implement the methodology.
Experiment Setup	Yes	CHECK used a cosine similarity threshold of 0.8 and was limited to a maximum of 50 new tokens per model call. We use a temperature scale of 0.0 to 1.0 on increments of 0.1.