reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

WikiBigEdit: Understanding the Limits of Lifelong Knowledge Editing in LLMs

Authors: Lukas Thede, Karsten Roth, Matthias Bethge, Zeynep Akata, Thomas Hartvigsen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using Wiki Big Edit, we thoroughly analyze the capability of existing lifelong knowledge editing methods to conduct lifelong edits at scale; contrasted against retrieval augmentation and continual finetuning to understand limits in relation to other established approaches.1
Researcher Affiliation	Academia	1T ubingen AI Center, University of T ubingen 2Helmholtz Munich 3Munich Center for Machine Learning (MCML) 4Technical University, Munich 5University of Virginia.
Pseudocode	No	The paper describes methods and pipelines but does not contain any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	1Code available at https://github.com/Explainable ML/Wiki Big Edit.
Open Datasets	Yes	We first introduce Wiki Big Edit; a large-scale benchmark of realworld Wikidata edits, built to automatically extend lifelong for future-proof benchmarking. In its first instance, it includes over 500K question-answer pairs for knowledge editing alongside a comprehensive evaluation pipeline.
Dataset Splits	Yes	These updates are grouped into B sequential batches [U1, U2, . . . , UB], where each batch can encompass anything from a single edit to multiple. For each batch update b (also denoted as timestep in this work), the model f b 1, trained on updates from prior batches U<b, is further updated with the current batch Ub to produce f b. ... SQA locality and SQA mhop are used for evaluation, while SQA changed constitutes the respective fact-based training data.
Hardware Specification	Yes	All experiments are performed on a compute cluster equipped with Nvidia A100 and H100 GPUs, leveraging Py Torch (Paszke et al., 2019) and building on the Easy Edit codebase (Zhang et al., 2024).
Software Dependencies	No	The paper mentions PyTorch and the Annoy solver but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	Adapters are trained for 10 epochs on each timestep. ... Cosine learning rate scheduling with warmup is employed during training, with a fixed number of 10 epochs per batch. ... For the main experiments, k = 2 was chosen to enable effective multi-hop reasoning while keeping the context length manageable. ... After training of each timestep, current adapter weights are simply merged into preceding adapter weights using an interpolation weight of 0.25.