reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MQuAKE-Remastered: Multi-Hop Knowledge Editing Can Only Be Advanced with Reliable Evaluations

Authors: Shaochen Zhong, Yifan (Louie) Lu, Lize Shao, Bhargav Bhushanam, Xiaocong Du, Yixin Wan, Yucheng Shi, Daochen Zha, Yiwei Wang, Ninghao Liu, Kaixiong Zhou, shuai xu, Kai-Wei Chang, Louis Feng, Vipin Chaudhary, Xia Ben Hu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Additionally, we benchmarked almost all proposed MQUAKE-evaluated editing methods on our post-fix dataset, MQUAKE-REMASTERED. We observe that many methods try to overfit the original MQUAKE by exploiting some dataset idiosyncrasies of MQUAKE. We provide a guideline on how to approach such datasets faithfully and show that a simple, minimally invasive approach GWalk can offer beyond SOTA editing performance without such exploitation. The MQUAKE-REMASTERED datasets and utilities are available at huggingface.co/datasets/henryzhongsc/MQu AKE-Remastered and github.com/henryzhongsc/MQu AKE-Remastered, respectively.
Researcher Affiliation	Collaboration	Department of Computer Science, Rice University School of Computing, University of Georgia Department of Electrical and Computer Engineering, North Carolina State University Department of Computer and Data Sciences, Case Western Reserve University Department of Computer Science, University of California, Los Angeles Meta Platforms, Inc.
Pseudocode	Yes	We share the detailed pseudocode of GWalk in Algorithm 1 and demonstrate some further studies in Appendix H.
Open Source Code	Yes	The MQUAKE-REMASTERED datasets and utilities are available at huggingface.co/datasets/henryzhongsc/MQu AKE-Remastered and github.com/henryzhongsc/MQu AKE-Remastered, respectively.
Open Datasets	Yes	The MQUAKE-REMASTERED datasets and utilities are available at huggingface.co/datasets/henryzhongsc/MQu AKE-Remastered and github.com/henryzhongsc/MQu AKE-Remastered, respectively.
Dataset Splits	Yes	Datasets like MQUAKE-CF and MQUAKE-CF-3K are often tested under varying editing intensities based on the number of cases considered edited. This simulates different levels of deviation between the model s learned knowledge and the newly edited information. This approach is effective because strong knowledge editing methods should handle both large-scale updates and smaller, more localized edits, ensuring that the changes do not interfere with unrelated knowledge.
Hardware Specification	No	This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University (CWRU).
Software Dependencies	Yes	We opt to use lmsys/vicuna-7b-v1.5 (Zheng et al., 2023b), mistralai/Mistral-7B-Instruct-v0.2 (Jiang et al., 2023), and meta-llama/Meta-Llama-3-8B-Instruct (AI@Meta, 2024) as the choice of question-answering models, both for alignment with existing works (Zhong et al., 2023; Shi et al., 2024; Gu et al., 2024) as well as providing coverage the most recent language models. For methods that require a text-embedding model as a retriever, we use facebook/contriever-msmarco (Izacard et al., 2022) for alignment with Me LLo (Zhong et al., 2023).
Experiment Setup	No	The paper does not explicitly detail hyperparameters or system-level training settings within the main text. It mentions different 'k-edited' settings for evaluation, but not training configuration or hyperparameter values.