reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Spurious Forgetting in Continual Learning of Language Models

Authors: Junhao Zheng, Xidi Cai, Shengjie Qiu, Qianli Ma

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study first explores the concept of spurious forgetting , proposing that such performance drops often reflect a decline in task alignment rather than knowledge loss. Through controlled experiments with a synthesized dataset, we investigate the dynamics of model performance during the initial training phases of new tasks, discovering that early optimization steps can disrupt previously established task alignments. Our theoretical analysis connects these shifts to orthogonal updates in model weights, providing a robust framework for understanding this behavior. Ultimately, we introduce a Freezing strategy that fix the bottom layers of the model, leading to substantial improvements in four continual learning scenarios.
Researcher Affiliation	Academia	Junhao Zheng, Xidi Cai, Shengjie Qiu, Qianli Ma School of Computer Science and Engineering, South China University of Technology EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes methods and strategies in paragraph text without structured pseudocode or algorithm blocks.
Open Source Code	Yes	The source code is publicly available 1. 1https://github.com/zzz47zzz/spurious-forgetting and https://github.com/ qianlima-lab/spurious-forgetting
Open Datasets	Yes	To promote reproducibility, all code, scripts, and the synthetic Biography dataset will be made publicly available. Construction of the Biography Dataset. The Biography dataset consists of 200,000 synthetic individuals, each characterized by six attributes: birthday, birth city, university attended, major, company name, and company city. This dataset is divided into two subsets: pretraining data and finetuning data. The pretraining data comprises statements describing each individual s attributes. For instance, Curtis Chase Emley recognizes his birth anniversary on May 28, 1952. The finetuning data consists of QA pairs designed for knowledge extraction, such as What is the birth date of Curtis Chase Emley? /n Answer: May 28, 1952. Unless otherwise stated, we calculate the exact match accuracy for the dataset. Further details and examples are provided in Appendix B.
Dataset Splits	Yes	The Biography dataset consists of 200,000 synthetic individuals... This dataset is divided into two subsets: pretraining data and finetuning data. The pretraining data comprises statements describing each individual s attributes... The finetuning data consists of QA pairs... Initially, the model is pretrained on 100,000 individuals to establish a robust knowledge foundation. Following this, we fine-tune the model on QA data from the same individuals (denoted as Task 0). We then introduce a new task (denoted as Task 1) that includes an additional 20,000 individuals unfamiliar to the model. Specifically, for any checkpoint during pretraining, Task 0 and Task 1, we fine-tune the model on half of the data from Task 0 for one epoch and evaluate it on the remaining half.
Hardware Specification	Yes	The pre-training experiments are executed on an NVIDIA A800 80GB GPU. The fine-tuning experiments are executed on NVIDIA RTX 3090 GPUs. All experiments are conducted on eight A100 GPUs.
Software Dependencies	No	All experiments are conducted using Py Torch. This mentions a software, but no specific version number is provided for PyTorch or any other dependencies.
Experiment Setup	Yes	For pre-training, we employed a conventional set of optimization parameters: the Adam W optimizer with a weight decay of 0.1, ϵ = 10 6, an initial learning rate of 0.001, a 1000-step linear warmup, and cosine learning rate decay (from 0.001 decreasing to 0.0001). There are a total of 80,000 training steps in the pre-training stage and the batch size is set to 96. All parameters of the language model are updated during the fine-tuning stage. We employ the Adam W optimizer with a weight decay of 0.01, ϵ = 10 6, an initial learning rate of 5 10 6, and cosine learning rate decay (from 5 10 6 to 4.5 10 6). There are 62,500 training steps in the finetuning stage and the batch size is set to 48.