Spurious Forgetting in Continual Learning of Language Models

Authors: Junhao Zheng, Xidi Cai, Shengjie Qiu, Qianli Ma

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This study first explores the concept of spurious forgetting , proposing that such performance drops often reflect a decline in task alignment rather than knowledge loss. Through controlled experiments with a synthesized dataset, we investigate the dynamics of model performance during the initial training phases of new tasks, discovering that early optimization steps can disrupt previously established task alignments. Our theoretical analysis connects these shifts to orthogonal updates in model weights, providing a robust framework for understanding this behavior. Ultimately, we introduce a Freezing strategy that fix the bottom layers of the model, leading to substantial improvements in four continual learning scenarios.
Researcher Affiliation Academia Junhao Zheng, Xidi Cai, Shengjie Qiu, Qianli Ma School of Computer Science and Engineering, South China University of Technology EMAIL EMAIL EMAIL
Pseudocode No The paper describes methods and strategies in paragraph text without structured pseudocode or algorithm blocks.
Open Source Code Yes The source code is publicly available 1. 1https://github.com/zzz47zzz/spurious-forgetting and https://github.com/ qianlima-lab/spurious-forgetting
Open Datasets Yes To promote reproducibility, all code, scripts, and the synthetic Biography dataset will be made publicly available. Construction of the Biography Dataset. The Biography dataset consists of 200,000 synthetic individuals, each characterized by six attributes: birthday, birth city, university attended, major, company name, and company city. This dataset is divided into two subsets: pretraining data and finetuning data. The pretraining data comprises statements describing each individual s attributes. For instance, Curtis Chase Emley recognizes his birth anniversary on May 28, 1952. The finetuning data consists of QA pairs designed for knowledge extraction, such as What is the birth date of Curtis Chase Emley? /n Answer: May 28, 1952. Unless otherwise stated, we calculate the exact match accuracy for the dataset. Further details and examples are provided in Appendix B.
Dataset Splits Yes The Biography dataset consists of 200,000 synthetic individuals... This dataset is divided into two subsets: pretraining data and finetuning data. The pretraining data comprises statements describing each individual s attributes... The finetuning data consists of QA pairs... Initially, the model is pretrained on 100,000 individuals to establish a robust knowledge foundation. Following this, we fine-tune the model on QA data from the same individuals (denoted as Task 0). We then introduce a new task (denoted as Task 1) that includes an additional 20,000 individuals unfamiliar to the model. Specifically, for any checkpoint during pretraining, Task 0 and Task 1, we fine-tune the model on half of the data from Task 0 for one epoch and evaluate it on the remaining half.
Hardware Specification Yes The pre-training experiments are executed on an NVIDIA A800 80GB GPU. The fine-tuning experiments are executed on NVIDIA RTX 3090 GPUs. All experiments are conducted on eight A100 GPUs.
Software Dependencies No All experiments are conducted using Py Torch. This mentions a software, but no specific version number is provided for PyTorch or any other dependencies.
Experiment Setup Yes For pre-training, we employed a conventional set of optimization parameters: the Adam W optimizer with a weight decay of 0.1, ϵ = 10 6, an initial learning rate of 0.001, a 1000-step linear warmup, and cosine learning rate decay (from 0.001 decreasing to 0.0001). There are a total of 80,000 training steps in the pre-training stage and the batch size is set to 96. All parameters of the language model are updated during the fine-tuning stage. We employ the Adam W optimizer with a weight decay of 0.01, ϵ = 10 6, an initial learning rate of 5 10 6, and cosine learning rate decay (from 5 10 6 to 4.5 10 6). There are 62,500 training steps in the finetuning stage and the batch size is set to 48.