MQuAKE-Remastered: Multi-Hop Knowledge Editing Can Only Be Advanced with Reliable Evaluations
Authors: Shaochen Zhong, Yifan (Louie) Lu, Lize Shao, Bhargav Bhushanam, Xiaocong Du, Yixin Wan, Yucheng Shi, Daochen Zha, Yiwei Wang, Ninghao Liu, Kaixiong Zhou, shuai xu, Kai-Wei Chang, Louis Feng, Vipin Chaudhary, Xia Ben Hu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Additionally, we benchmarked almost all proposed MQUAKE-evaluated editing methods on our post-fix dataset, MQUAKE-REMASTERED. We observe that many methods try to overfit the original MQUAKE by exploiting some dataset idiosyncrasies of MQUAKE. We provide a guideline on how to approach such datasets faithfully and show that a simple, minimally invasive approach GWalk can offer beyond SOTA editing performance without such exploitation. The MQUAKE-REMASTERED datasets and utilities are available at huggingface.co/datasets/henryzhongsc/MQu AKE-Remastered and github.com/henryzhongsc/MQu AKE-Remastered, respectively. |
| Researcher Affiliation | Collaboration | Department of Computer Science, Rice University School of Computing, University of Georgia Department of Electrical and Computer Engineering, North Carolina State University Department of Computer and Data Sciences, Case Western Reserve University Department of Computer Science, University of California, Los Angeles Meta Platforms, Inc. |
| Pseudocode | Yes | We share the detailed pseudocode of GWalk in Algorithm 1 and demonstrate some further studies in Appendix H. |
| Open Source Code | Yes | The MQUAKE-REMASTERED datasets and utilities are available at huggingface.co/datasets/henryzhongsc/MQu AKE-Remastered and github.com/henryzhongsc/MQu AKE-Remastered, respectively. |
| Open Datasets | Yes | The MQUAKE-REMASTERED datasets and utilities are available at huggingface.co/datasets/henryzhongsc/MQu AKE-Remastered and github.com/henryzhongsc/MQu AKE-Remastered, respectively. |
| Dataset Splits | Yes | Datasets like MQUAKE-CF and MQUAKE-CF-3K are often tested under varying editing intensities based on the number of cases considered edited. This simulates different levels of deviation between the model s learned knowledge and the newly edited information. This approach is effective because strong knowledge editing methods should handle both large-scale updates and smaller, more localized edits, ensuring that the changes do not interfere with unrelated knowledge. |
| Hardware Specification | No | This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University (CWRU). |
| Software Dependencies | Yes | We opt to use lmsys/vicuna-7b-v1.5 (Zheng et al., 2023b), mistralai/Mistral-7B-Instruct-v0.2 (Jiang et al., 2023), and meta-llama/Meta-Llama-3-8B-Instruct (AI@Meta, 2024) as the choice of question-answering models, both for alignment with existing works (Zhong et al., 2023; Shi et al., 2024; Gu et al., 2024) as well as providing coverage the most recent language models. For methods that require a text-embedding model as a retriever, we use facebook/contriever-msmarco (Izacard et al., 2022) for alignment with Me LLo (Zhong et al., 2023). |
| Experiment Setup | No | The paper does not explicitly detail hyperparameters or system-level training settings within the main text. It mentions different 'k-edited' settings for evaluation, but not training configuration or hyperparameter values. |