reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Uncovering Overfitting in Large Language Model Editing

Authors: Mengqi Zhang, Xiaotian Ye, Qiang Liu, shu wu, Pengjie Ren, Zhumin Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive experiments and analysis, we demonstrate that Editing Overfit is prevalent in current editing methods and that common overfitting mitigation strategies are ineffective in knowledge editing. To overcome this, inspired by LLMs knowledge recall mechanisms, we propose a new plug-and-play strategy called Learn the Inference (LTI), which introduce a Multi-stage Inference Constraint module to guide the edited models in recalling new knowledge similarly to how unedited LLMs leverage knowledge through in-context learning. Extensive experimental results across a wide range of tasks validate the effectiveness of LTI in mitigating Editing Overfit.
Researcher Affiliation	Academia	1Shandong University 2School of Computer Science, Beijing University of Posts and Telecommunications 3New Laboratory of Pattern Recognition (NLPR) State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS) Institute of Automation, Chinese Academy of Sciences EMAIL EMAIL , EMAIL
Pseudocode	No	The paper describes methods using prose and mathematical formulas, for example, L = λLSRC + βLODC + αLN, but it does not contain any explicitly labeled pseudocode or algorithm blocks. The computational steps are described within the regular text flow.
Open Source Code	No	The paper states: 'Our experiments build on the codebase implemented by Meng et al. (2022a;b).' This refers to a third-party codebase or prior work used as a foundation, not an explicit statement that the authors of this paper are releasing their own implementation code or providing a link to it.
Open Datasets	Yes	To further explore this issue, we introduce a new benchmark, EVOKE (EValuation of Editing Overfit in Knowledge Editing), along with fine-grained evaluation metrics. Through comprehensive experiments and analysis, we demonstrate that Editing Overfit is prevalent in current editing methods and that common overfitting mitigation strategies are ineffective in knowledge editing. ... Details of EVOKE construction can be found in Appendix C. ... The edit requests and recall task data in EVOKE are sourced from established benchmarks, including RIPPLEEDITS-POPULAR and COUNTERFACT, specifically leveraging a subset of their test splits. For Multi-hop Reasoning task, we augment the existing dataset by incorporating newly constructed data. Specifically, Building upon the COUNTERFACTPLUS dataset, which lacks original answers for multi-hop questions, we generate the missing answers using GPT-4o, following the methodology used in (Yao et al., 2023). ... Leveraging the locality data from COUNTERFACT, the Relation Specificity task probes the consistency of edited models responses across prompts that share the same relation but differ in subject entities.
Dataset Splits	No	The paper mentions that 'EVOKE' is constructed using 'test splits' from existing datasets like RIPPLEEDITS-POPULAR and COUNTERFACT. It also describes augmenting data for the Multi-hop Reasoning task. However, it does not provide specific percentages, absolute counts, or detailed methodologies for how the full EVOKE benchmark or its individual tasks are split into training, validation, and test sets to ensure reproducibility for the experimental results.
Hardware Specification	No	The paper mentions the use of models like GPT-J (Wang & Komatsuzaki, 2021), GPT-2 XL (Radford et al., 2019), and Llama-2-7B (Touvron et al., 2023) but does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments or train these models.
Software Dependencies	No	Our experiments build on the codebase implemented by Meng et al. (2022a;b). ... All other baseline implementations, including hyperparameters, remain consistent with the setup of Meng et al. (2022a;b), and hyperparameters on Llama-2-7B remain consistent with Yao et al. (2023). While these statements imply the use of specific software frameworks, no version numbers for programming languages, libraries (e.g., PyTorch, TensorFlow), or other software components are provided.
Experiment Setup	Yes	Other Hyperparameter Settings LTI s hyperparameters include coefficients for the three constraint losses. In practice, the coefficient λ for the Subject Representation Constraint is set to 0.0625, while the Output Distribution Constraint coefficient β is 0.0325. For the New Knowledge Constraint coefficient α, ROME-LTI uses values of 0.0625 for GPT-J, 0.15 for GPT-2 XL and 0.35 for Llama2-7B, whereas MEMIT-LTI employs 0.25 for GPT-J and 0.125 for GPT-2 XL and Llama-2-7B. ... In our experiments, we maintain a constant product of the hyperparameter clamp norm factor and the number of edited layers. When adjusting the number of layers, we keep the highest edited layer fixed. For example, when editing three layers, we select l = {6, 7, 8}.