reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mitigating Heterogeneous Token Overfitting in LLM Knowledge Editing

Authors: Tianci Liu, Ruirui Li, Zihan Dong, Hui Liu, Xianfeng Tang, Qingyu Yin, Linjun Zhang, Haoyu Wang, Jing Gao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across four editing methods, two LLMs, and diverse scenarios demonstrate the effectiveness and versatility of our method.
Researcher Affiliation	Collaboration	1Purdue University 2Amazon 3Rutgers University 4University at Albany.
Pseudocode	Yes	Algorithm 1 OVERTONE Training Paradigm
Open Source Code	No	The paper states that "All of our experiments are run on Easy Edit (Wang et al., 2024e)", which is a third-party framework used by the authors. It does not explicitly provide a link or statement about the release of the source code for their proposed method, OVERTONE.
Open Datasets	Yes	Following Wang et al. (2023b); Zhang et al. (2024c), we edit different kinds of knowledge: Wiki Datarecent, Wiki Datacounterfact (Cohen et al., 2024), Wiki Bio (Hartvigsen et al., 2024), and Zs RE (Yao et al., 2023). Besides the four popular benchmarks, we also explore more complex MQu AKE (Zhong et al., 2023; Wang et al., 2024f).
Dataset Splits	No	The paper mentions using well-known benchmarks such as Zs RE, Wiki Datarecent, Wiki Datacounterfact, Wiki Bio, and MQu AKE. While these benchmarks typically have predefined splits, the paper does not explicitly state the training/test/validation splits used, their percentages, or cite a specific split methodology for these datasets within the main text or appendices.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models.
Software Dependencies	No	The paper mentions using "Easy Edit" as a framework but does not list specific software dependencies (e.g., programming languages, libraries, or other tools) with their version numbers.
Experiment Setup	Yes	FT-M used the following hyperparameters: On MQu AKE: Layers to tune: (20,21,22,23,24). Learning rate: 1e-3. Others unchanged. Lo RA used the following hyperparameters: On MQu AKE: Lo RA rank: 12. Iteration numbers: 50. Others unchanged. MELO used the following hyperparameters: We set initial radius for each code in the code-book to 60 for LLa MA 2, and 30 for LLa MA 3. In generation, we set temperature to 0.1. The maximum length was 30 for Single-Hop questions, and 200 for Multi-Hop questions.