MQuAKE-Remastered: Multi-Hop Knowledge Editing Can Only Be Advanced with Reliable Evaluations

Authors: Shaochen Zhong, Yifan (Louie) Lu, Lize Shao, Bhargav Bhushanam, Xiaocong Du, Yixin Wan, Yucheng Shi, Daochen Zha, Yiwei Wang, Ninghao Liu, Kaixiong Zhou, shuai xu, Kai-Wei Chang, Louis Feng, Vipin Chaudhary, Xia Ben Hu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Additionally, we benchmarked almost all proposed MQUAKE-evaluated editing methods on our post-fix dataset, MQUAKE-REMASTERED. We observe that many methods try to overfit the original MQUAKE by exploiting some dataset idiosyncrasies of MQUAKE. We provide a guideline on how to approach such datasets faithfully and show that a simple, minimally invasive approach GWalk can offer beyond SOTA editing performance without such exploitation. The MQUAKE-REMASTERED datasets and utilities are available at huggingface.co/datasets/henryzhongsc/MQu AKE-Remastered and github.com/henryzhongsc/MQu AKE-Remastered, respectively.
Researcher Affiliation Collaboration Department of Computer Science, Rice University School of Computing, University of Georgia Department of Electrical and Computer Engineering, North Carolina State University Department of Computer and Data Sciences, Case Western Reserve University Department of Computer Science, University of California, Los Angeles Meta Platforms, Inc.
Pseudocode Yes We share the detailed pseudocode of GWalk in Algorithm 1 and demonstrate some further studies in Appendix H.
Open Source Code Yes The MQUAKE-REMASTERED datasets and utilities are available at huggingface.co/datasets/henryzhongsc/MQu AKE-Remastered and github.com/henryzhongsc/MQu AKE-Remastered, respectively.
Open Datasets Yes The MQUAKE-REMASTERED datasets and utilities are available at huggingface.co/datasets/henryzhongsc/MQu AKE-Remastered and github.com/henryzhongsc/MQu AKE-Remastered, respectively.
Dataset Splits Yes Datasets like MQUAKE-CF and MQUAKE-CF-3K are often tested under varying editing intensities based on the number of cases considered edited. This simulates different levels of deviation between the model s learned knowledge and the newly edited information. This approach is effective because strong knowledge editing methods should handle both large-scale updates and smaller, more localized edits, ensuring that the changes do not interfere with unrelated knowledge.
Hardware Specification No This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University (CWRU).
Software Dependencies Yes We opt to use lmsys/vicuna-7b-v1.5 (Zheng et al., 2023b), mistralai/Mistral-7B-Instruct-v0.2 (Jiang et al., 2023), and meta-llama/Meta-Llama-3-8B-Instruct (AI@Meta, 2024) as the choice of question-answering models, both for alignment with existing works (Zhong et al., 2023; Shi et al., 2024; Gu et al., 2024) as well as providing coverage the most recent language models. For methods that require a text-embedding model as a retriever, we use facebook/contriever-msmarco (Izacard et al., 2022) for alignment with Me LLo (Zhong et al., 2023).
Experiment Setup No The paper does not explicitly detail hyperparameters or system-level training settings within the main text. It mentions different 'k-edited' settings for evaluation, but not training configuration or hyperparameter values.