An Empirical Investigation of the Role of Pre-training in Lifelong Learning
Authors: Sanket Vaibhav Mehta, Darshan Patil, Sarath Chandar, Emma Strubell
JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate existing methods in the context of large, pre-trained models and evaluate their performance on a variety of text and image classification tasks, including a large-scale study using a novel data set of 15 diverse NLP tasks. Across all settings, we observe that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially compared to randomly initialized models. We then further investigate why pre-training alleviates forgetting in this setting. We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima. Based on this insight, we propose jointly optimizing for current task loss and loss basin sharpness to explicitly encourage wider basins during sequential fine-tuning. We show that this optimization approach outperforms several state-of-the-art task-sequential continual learning algorithms across multiple settings, occasionally even without retaining a memory that scales in size with the number of tasks. |
| Researcher Affiliation | Academia | Sanket Vaibhav Mehta EMAIL School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA Darshan Patil EMAIL Mila Quebec AI Institute Université de Montréal Montreal, QC H3T 1J4, Canada Sarath Chandar EMAIL Mila Quebec AI Institute Canada CIFAR AI Chair École Polytechnique de Montréal Montreal, QC H3T 1J4, Canada Emma Strubell EMAIL School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA |
| Pseudocode | No | The paper describes methods and procedures in paragraph form and through mathematical equations, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | 2. Code is available at https://github.com/sanketvmehta/lifelong-learning-pretraining-and-sam |
| Open Datasets | Yes | We perform extensive experiments on widely adopted task-incremental learning benchmarks (Chaudhry et al., 2019; Ebrahimi et al., 2020; Wang et al., 2020) across both CV and NLP domains. 5-dataset-CV consists of five diverse 10-way image classification tasks: CIFAR-10 (Krizhevsky and Hinton, 2009), MNIST (Le Cun, 1998), Fashion-MNIST (Xiao et al., 2017), SVHN (Netzer et al., 2011), and not MNIST (Bulatov, 2011). Split Yahoo QA consists of five homogenous 2-way classification tasks and is built from a 10-way topic classification data set (Yahoo QA; Zhang et al., 2015). 15-dataset-NLP, a novel suite of diverse tasks for lifelong learning. It consists of fifteen text classification tasks covering a broad range of domains and data sources. We design our benchmark from existing tasks... |
| Dataset Splits | Yes | Table 1: 5-dataset-CV statistics. |Train|, |Dev|, |Test| denotes the number of examples in the train, dev, and test splits respectively. Split CIFAR-50 [...] Each task contains 5,000/1,000 (train/test) examples. Split CIFAR-100 splits the CIFAR-100 data set into 20 disjoint 5-way classification tasks, with each task containing 2,500/500 (train/test) examples. Split Yahoo QA [...] Each task includes around 279k/12k (train/test) examples. 5-dataset-NLP [...] we have 115k/7.6k (train/test) examples per task. Table 2 details the evaluation metrics and train/dev/test split sizes for each task. |
| Hardware Specification | No | We like to acknowledge CMU Workhorse, TIR group, and Compute Canada for providing compute resources for this work. |
| Software Dependencies | No | The paper mentions 'Hugging Face' for default implementation, 'scipy' for L-BFGS-B algorithm, 'Adam' as an optimizer, and 'pytorch-hessian-eigenthings' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Appendix A. Implementation Details A.1 CV Experiments: For all vision experiments, we use the full Res Net-18 He et al. (2016) architecture, with the final linear layer replaced... We used an SGD optimizer with the learning rate set to .01... The batch size was set to 10 for the Split CIFAR-50 and Split CIFAR-100 experiments and 64 for the 5-dataset-CV experiments. The memory per class for ER was set to 1, and the λ parameter for EWC was also set to 1. For Stable SGD, we performed a hyperparameter sweep over the parameters specified in the original paper... For Mode Connectivity SGD... we used an initial learning rate of 0.1, a learning rate decay of 0.8, a momentum of 0.8, a dropout of 0.1, a batch size of 10... A.2 NLP Experiments: We use Adam as our optimizer, set dropout 0.1, the base learning rate to 2e 5, batch size to 32 and the maximum total input sequence length after tokenization to 128. For EWC, we set the regularization strength λ to 100... for ER... the memory per class per task is set to 1. For SAM, we set ρ = 0.02 for all models... For Split Yahoo QA we set ρ = 0.001. |