reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AnyEdit: Edit Any Knowledge Encoded in Language Models

Authors: Houcheng Jiang, Junfeng Fang, Ningyu Zhang, Mingyang Wan, Guojun Ma, Xiang Wang, Xiangnan He, Tat-Seng Chua

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate Any Edit, we conducted a comprehensive evaluation comparing it with leading model editing methods (e.g., MEMIT (Meng et al., 2023), Alpha Edit (Fang et al., 2025), and Un KE (Deng et al., 2025)) on the prevailing LLMs such as Llama3-8B-Instruct2 and Qwen2.5-7B-Chat (Yang et al., 2024). Beyond standard benchmark datasets (e.g., Counterfact (Meng et al., 2022) and Zs RE (Meng et al., 2023)) that represent knowledge as triples, we curate Edit Everything, a new benchmark for long-form diverse-formatted knowledge. As shown in Figure 1 (f), this dataset includes entries up to 458 tokens over twice the length of the longest sequences in existing benchmarks (e.g., 156 tokens in AKEW (Wu et al., 2024)) and spans multiple domains, including mathematics, news, code, and biochemistry. Results on Edit Everything and standard benchmarks demonstrate that Any Edit surpasses all baselines, achieving a 21.5% average improvement in editing accuracy with comparable computational overhead.
Researcher Affiliation	Collaboration	Houcheng Jiang 1 Junfeng Fang 2 * Ningyu Zhang 3 Mingyang Wan 4 Guojun Ma 4 Xiang Wang 1 Xiangnan He 1 * Tat-Seng Chua 2 1Mo E Key Lab of BIPC, University of Science and Technology of China 2National University of Singapore 3Zhejiang University 4Douyin Co., Ltd.
Pseudocode	No	The paper describes the 'Implementation Details' in Section 4.2 as a four-step process. However, it does not present these steps in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Our code is available at: https://github. com/jianghoucheng/Any Edit.
Open Datasets	Yes	To evaluate the performance of unstructured long-form knowledge editing, we employed existing benchmarks, including Un KEBench (Deng et al., 2025) and AKEW (Wu et al., 2024).
Dataset Splits	No	The paper mentions using specific datasets like Un KEBench and AKEW, and constructing Edit Everything, but does not provide explicit details about training, validation, or test splits (e.g., percentages, sample counts, or specific methodologies) for any of these datasets in the main text or appendix.
Hardware Specification	Yes	All experiments were conducted on a single A100 GPU (80GB).
Software Dependencies	No	The paper does not provide specific version numbers for key software components such as Python, PyTorch, or CUDA. It mentions using 'the all-Mini LM-L6-v2 model' but this refers to a model, not a software library with a version.
Experiment Setup	Yes	We select layers 4 to 8 for editing and apply a clamp norm factor of 4. The fact token is defined as the last token. The optimization process involves 25 gradient steps for updating the key-value representations, with a learning rate of 0.5. The loss is applied at layer 31, and we use a weight decay of 0.001. To maintain distributional consistency, we introduce a Kullback-Leibler (KL) regularization term with a factor of 0.0625. Furthermore, we enable hyperparameter λ with an update weight of 15,000, using 100,000 samples from the Wikipedia dataset with a data type of float32. The module configurations follow MEMIT, where edits are applied to the MLP down projection layers of the selected transformer blocks. Additionally, for chunked editing, we set a chunk size of 40 tokens with no overlap.