reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ELDER: Enhancing Lifelong Model Editing with Mixture-of-LoRA

Authors: Jiaang Li, Quan Wang, Zhongnan Wang, Yongdong Zhang, Zhendong Mao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on GPT-2 XL and LLa MA2-7B demonstrate that ELDER effectively edits models in the lifelong setting, outperforming eight baselines while exhibiting strong scalability and preserving LLMs general abilities on downstream tasks.
Researcher Affiliation	Academia	1 University of Science and Technology of China 2Beijing University of Posts and Telecommunications EMAIL, EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Inference with deferral mechanism.
Open Source Code	No	The paper does not contain an explicit statement about releasing code or a link to a code repository.
Open Datasets	Yes	Our experiments are conducted on two popular LLMs, i.e., GPT2-XL(Radford et al. 2019) and LLa MA2-7B (Touvron et al. 2023), with two widely used model editing datasets, Zs RE (Levy et al. 2017) and COUNTERFACT (Meng et al. 2022). ... Specifically, we employ a benchmark from (Gu et al. 2024), including eight diverse tasks: Reasoning on GSM8K (Cobbe et al. 2021), Natural Language Inference on RTE(Dagan, Glickman, and Magnini 2005), Open-domain QA on Natural Question(Kwiatkowski et al. 2019), Closed-domain QA on Bool Q(Clark et al. 2019), Dialogue on Mu Tual(Cui et al. 2020), Summarization on SAMSum (Cui et al. 2020), Named Entity Recognition on Co NLL03(Sang and De Meulder 2003), and Sentiment Analysis on SST2(Socher et al. 2013).
Dataset Splits	No	The paper states: 'We adopt both datasets to the lifelong model editing setting by extracting a sequence of 1000 editing samples with their rephrasings for our main experiments, following the methodologies outlined in (Hartvigsen et al. 2024) and (Yu et al. 2024)'. This indicates existing methodologies were followed for data preparation but does not provide specific split percentages or counts for training, validation, and test sets within the main text.
Hardware Specification	No	The paper does not specify the exact hardware (e.g., GPU/CPU models, memory) used for running the experiments. It only mentions the base LLMs (GPT2-XL, LLa MA2-7B).
Software Dependencies	No	The paper mentions several techniques and models (e.g., Lo RA, GPT2-XL, LLa MA2) and concepts (e.g., PyTorch implicitly for deep learning), but it does not specify version numbers for any software libraries or frameworks used in the implementation.
Experiment Setup	Yes	For our proposed ELDER across all settings, the rank of Lo RAs is set to 8, and the number of layers that apply mixture-of-Lo RA is set to 6. The number of Lo RAs per layer is set to 4, k is set to 2, and ϵ is set to 12. λ is set to 1e 2. More details of training and hyperparameter tuning are available in the technical appendix.