reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing

Authors: Zhuoran Zhang, Yongxiang Li, Zijian Kan, Keyuan Cheng, Lijie Hu, Di Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that IFMET significantly improves performance on multi-hop factual recall tasks, overcoming the limitations of previous locate-then-edit methods.
Researcher Affiliation	Academia	1Peking University 2Provable Responsible AI and Data Analytics (PRADA) Lab 3South China University of Technology 4King Abdullah University of Science and Technology. Correspondence to: Lijie Hu <EMAIL>, Di Wang <EMAIL>.
Pseudocode	Yes	Due to space limitations, the flowchart of the algorithm and related implementation details are provided in Algorithm 1 and Appendix D.2.
Open Source Code	No	The paper does not provide an explicit statement about open-sourcing code or a link to a code repository.
Open Datasets	Yes	MQu AKE-3K (Zhong et al., 2023), a challenging and widely used dataset designed to evaluate models ability to perform multi-hop fact recall with newly edited knowledge.
Dataset Splits	No	The paper describes how data is used in 'evaluation scenarios' and for 'few-shot setting' and 'chain-of-thought prompting', but it does not specify explicit train/test/validation dataset splits with percentages, counts, or references to predefined splits for reproduction of model training or evaluation. It mentions sampling subsets for analysis but not for general model split.
Hardware Specification	No	The paper mentions models like GPT-J-6B and LLa MA-2-7B and provides timing results for them, but it does not specify the underlying hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies, such as library names with version numbers (e.g., Python, PyTorch, CUDA versions), that would be needed to replicate the experiment.
Experiment Setup	Yes	In both the first and furtherance edits, our configuration for PMET adheres to the settings specified by (Li et al., 2024c). Initially, we set φ = 1 and 0 µ 1 to manage the retention of the model s original knowledge... After maximizing the probability of the target knowledge, we reduce φ to 0.1... Optimization is halted when DKL < 0.01... we set λ = 6000. When optimizing, we limit the total optimization steps to 30 with a learning rate of 0.2... we adhered to the few-shot in Table 12 and Chain of Thought (Co T) templates in Table 10 and procedures as outlined in (Zhong et al., 2023).