reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An Empirical Investigation of the Role of Pre-training in Lifelong Learning

Authors: Sanket Vaibhav Mehta, Darshan Patil, Sarath Chandar, Emma Strubell

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate existing methods in the context of large, pre-trained models and evaluate their performance on a variety of text and image classiﬁcation tasks, including a large-scale study using a novel data set of 15 diverse NLP tasks. Across all settings, we observe that generic pre-training implicitly alleviates the eﬀects of catastrophic forgetting when learning multiple tasks sequentially compared to randomly initialized models. We then further investigate why pre-training alleviates forgetting in this setting. We study this phenomenon by analyzing the loss landscape, ﬁnding that pre-trained weights appear to ease forgetting by leading to wider minima. Based on this insight, we propose jointly optimizing for current task loss and loss basin sharpness to explicitly encourage wider basins during sequential ﬁne-tuning. We show that this optimization approach outperforms several state-of-the-art task-sequential continual learning algorithms across multiple settings, occasionally even without retaining a memory that scales in size with the number of tasks.
Researcher Affiliation	Academia	Sanket Vaibhav Mehta EMAIL School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA Darshan Patil EMAIL Mila Quebec AI Institute Université de Montréal Montreal, QC H3T 1J4, Canada Sarath Chandar EMAIL Mila Quebec AI Institute Canada CIFAR AI Chair École Polytechnique de Montréal Montreal, QC H3T 1J4, Canada Emma Strubell EMAIL School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA
Pseudocode	No	The paper describes methods and procedures in paragraph form and through mathematical equations, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	2. Code is available at https://github.com/sanketvmehta/lifelong-learning-pretraining-and-sam
Open Datasets	Yes	We perform extensive experiments on widely adopted task-incremental learning benchmarks (Chaudhry et al., 2019; Ebrahimi et al., 2020; Wang et al., 2020) across both CV and NLP domains. 5-dataset-CV consists of ﬁve diverse 10-way image classiﬁcation tasks: CIFAR-10 (Krizhevsky and Hinton, 2009), MNIST (Le Cun, 1998), Fashion-MNIST (Xiao et al., 2017), SVHN (Netzer et al., 2011), and not MNIST (Bulatov, 2011). Split Yahoo QA consists of ﬁve homogenous 2-way classiﬁcation tasks and is built from a 10-way topic classiﬁcation data set (Yahoo QA; Zhang et al., 2015). 15-dataset-NLP, a novel suite of diverse tasks for lifelong learning. It consists of ﬁfteen text classiﬁcation tasks covering a broad range of domains and data sources. We design our benchmark from existing tasks...
Dataset Splits	Yes	Table 1: 5-dataset-CV statistics. \|Train\|, \|Dev\|, \|Test\| denotes the number of examples in the train, dev, and test splits respectively. Split CIFAR-50 [...] Each task contains 5,000/1,000 (train/test) examples. Split CIFAR-100 splits the CIFAR-100 data set into 20 disjoint 5-way classiﬁcation tasks, with each task containing 2,500/500 (train/test) examples. Split Yahoo QA [...] Each task includes around 279k/12k (train/test) examples. 5-dataset-NLP [...] we have 115k/7.6k (train/test) examples per task. Table 2 details the evaluation metrics and train/dev/test split sizes for each task.
Hardware Specification	No	We like to acknowledge CMU Workhorse, TIR group, and Compute Canada for providing compute resources for this work.
Software Dependencies	No	The paper mentions 'Hugging Face' for default implementation, 'scipy' for L-BFGS-B algorithm, 'Adam' as an optimizer, and 'pytorch-hessian-eigenthings' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Appendix A. Implementation Details A.1 CV Experiments: For all vision experiments, we use the full Res Net-18 He et al. (2016) architecture, with the ﬁnal linear layer replaced... We used an SGD optimizer with the learning rate set to .01... The batch size was set to 10 for the Split CIFAR-50 and Split CIFAR-100 experiments and 64 for the 5-dataset-CV experiments. The memory per class for ER was set to 1, and the λ parameter for EWC was also set to 1. For Stable SGD, we performed a hyperparameter sweep over the parameters speciﬁed in the original paper... For Mode Connectivity SGD... we used an initial learning rate of 0.1, a learning rate decay of 0.8, a momentum of 0.8, a dropout of 0.1, a batch size of 10... A.2 NLP Experiments: We use Adam as our optimizer, set dropout 0.1, the base learning rate to 2e 5, batch size to 32 and the maximum total input sequence length after tokenization to 128. For EWC, we set the regularization strength λ to 100... for ER... the memory per class per task is set to 1. For SAM, we set ρ = 0.02 for all models... For Split Yahoo QA we set ρ = 0.001.