reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Editing Memories Through Few Targeted Neurons

Authors: Wei Zhou, Wei Wei, Guibang Cao, Fei Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments have demonstrated the superior editing performance achieved by our proposed method. Our experiments are conducted on GPT2-XL(1.5B). Our baseline methods mainly adopt several types of editing parameters directly, including improved Constrained Fine Tuning(FT+W) (Zhu et al. 2020), the meta-learning method MEND (Mitchell et al. 2021), and the locate-and-optimize method ROME (Meng et al. 2022a), MEMIT (Meng et al. 2022b) and PMET (Li et al. 2024). For datasets, we performed counterfactual edit experiments on the dataset, COUNTERFACT (Meng et al. 2022a).
Researcher Affiliation	Collaboration	Wei Zhou1, 2, Wei Wei1, 2*, Guibang Cao3, Fei Wang4 1 Cognitive Computing and Intelligent Information Processing (CCIIP) Laboratory, School of Computer Science and Technology, Huazhong University of Science and Technology, China 2 Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), China 3Ping An Property & Casualty Insurance Company of China, Ltd 4 Institute of Computing Technology, Chinese Academy of Sciences
Pseudocode	No	The paper describes its methodology using mathematical equations and textual descriptions (e.g., equations 5-13) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/lifeforzw/TNF-DA
Open Datasets	Yes	For datasets, we performed counterfactual edit experiments on the dataset, COUNTERFACT (Meng et al. 2022a). We conduct experiments on GPT2-small and GPT2-medium (Radford et al. 2019)(without fine-tuning) using 1000 knowledge descriptions they know for sure.
Dataset Splits	No	For datasets, we performed counterfactual edit experiments on the dataset, COUNTERFACT (Meng et al. 2022a). More details about datasets can be found in Appendix B. ... And we finally get 2K pieces of counterfactual edits for GPT2-xl. More experiment details are shown in Appendix C. The paper describes using a dataset for counterfactual edits but does not specify explicit training/test/validation splits in terms of percentages or exact sample counts in the main body. It refers to appendices for more details, which are not provided in the main text.
Hardware Specification	No	Our experiments are conducted on GPT2-XL(1.5B). The paper mentions the model used (GPT2-XL(1.5B)) but does not provide specific hardware details like GPU models, CPU specifications, or cloud computing instances used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., library names with versions, programming language versions, or solver versions).
Experiment Setup	No	The cost of fine-tuning the entire model is huge, so only the parameters of the knowledge neurons θkn are collected to modify, with the rest θ\θkn frozen. However, as the model scale increases, the time for searching targeted neurons increases. To improve the efficiency, we heuristically narrow the search place, analyzing the degree of noise influences DNI: ... Hs = {hi l H \| DNIi l Pγ(DNI)} where Pγ(DNI) represents the set of DNI values in the top γ%. ... So finally the train loss is obtained by: L = Lgen + α Lloc. Here, α is a hyperparameter, adjusting the ratio of Lloc to the final loss L. ... we finetune the parameters of targeted neurons with only the give edit (p(s, r), o) and (xj p(s, r), o), and set a sufficient number of iterations to ensure the efficiency. While the paper describes the loss function and mentions "α is a hyperparameter" and "sufficient number of iterations", it lacks concrete values for hyperparameters like learning rate, batch size, or specific epoch counts for its main experiments.