Metalearning Continual Learning Algorithms

Authors: Kazuki Irie, Róbert Csordás, Jürgen Schmidhuber

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that ACL effectively resolves in-context catastrophic forgetting, a problem that naive in-context learning algorithms suffer from; ACL-learned algorithms outperform both hand-crafted learning algorithms and popular meta-continual learning methods on the Split-MNIST benchmark in the replay-free setting, and enables continual learning of diverse tasks consisting of multiple standard image classification datasets. Our experiments reveal various facets of in-context CL: (1) we show that without ACL, naive in-context learners suffer from in-context catastrophic forgetting (Sec. 4.1); we illustrate its emergence (Sec. 4.2) using comprehensible two-task settings, (2) we show very promising practical results of ACL by successfully metalearning a CL algorithm that outperforms hand-crafted learning algorithms and prior meta-continual learning methods (Javed and White, 2019; Beaulieu et al., 2020; Banayeeanzade et al., 2021) on the classic Split-MNIST benchmark (Hsu et al. (2018); Van de Ven and Tolias (2018b); Sec. 4.3), and (3) we highlight the current limitations and the need for further scaling up ACL, through a comparison with the prompt-based CL methods (Wang et al., 2022b;a) that leverage pre-trained models, using Split-CIFAR100 and 5-datasets (Ebrahimi et al., 2020).
Researcher Affiliation Academia Kazuki Irie EMAIL Harvard University, Cambridge, MA, USA Róbert Csordás EMAIL Stanford University, Stanford, CA, USA Jürgen Schmidhuber EMAIL Center for Generative AI, KAUST, Thuwal, Saudi Arabia The Swiss AI Lab, IDSIA, USI & SUPSI, Lugano, Switzerland
Pseudocode No The paper describes mathematical equations for the SRWM dynamics (Eqs. 1-5) and textual descriptions of the method, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is public: https://github.com/IDSIA/automated-cl.
Open Datasets Yes Our experiments focus on supervised image classification, making use of standard few-shot learning datasets for meta-training, namely, Mini-Image Net (Vinyals et al., 2016; Ravi and Larochelle, 2017), Omniglot (Lake et al., 2015), and FC100 (Oreshkin et al., 2018), while we also meta-test on other datasets including MNIST (Le Cun et al., 1998), Fashion MNIST (Xiao et al., 2017) and CIFAR-10 (Krizhevsky, 2009).
Dataset Splits Yes For Omniglot (Lake et al., 2015), we use Vinyals et al. (2016) s 1028/172/432-split for the train/validation/test set, as well as their data augmentation methods using rotation of 90, 180, and 270 degrees. ... Mini-Image Net contains color images from 100 classes with 600 examples for each class. We use the standard train/valid/test class splits of 64/16/20 following Ravi and Larochelle (2017). FC100 is based on CIFAR100 (Krizhevsky, 2009). 100 color image classes (600 images per class, each of size 32 32) are split into train/valid/test classes of 60/20/20 (Oreshkin et al., 2018). ... For the standard datasets such as MNIST, we split the dataset into subsets of disjoint classes (Srivastava et al., 2013): for example for MNIST which is originally a 10-way classification task, we split it into two 5-way tasks, one consisting of images of class 0 to 4 ( MNIST-04 ), and another one made of class 5 to 9 images ( MNIST59 ). ... Unless stated otherwise, we concatenate 15 examples from each class for each task in the context for both meta-training and meta-testing (resulting in the sequences of length 75 for each task).
Hardware Specification Yes We conduct our experiments using a single V100-32GB, 2080-12GB or P100-16GB GPUs, and the longest single meta-training run takes about one day.
Software Dependencies No The paper mentions software like 'torchmeta (Deleu et al., 2019)', 'PyTorch' (Paszke et al., 2019), and 'torchvision', but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes All hyperparameters are summarized in Table 5. We use the Adam optimizer with the standard Transformer learning rate warmup scheduling (Vaswani et al., 2017). The vision backend is the classic 4-layer convolutional NN of Vinyals et al. (2016). ... Table 5: Number of SRWM layers 2, Total hidden size 256, Feedforward block multiplier 2, Number of heads 16, Batch size 16 or 32. ... All our models use instance normalization (IN; Ulyanov et al. (2016)) instead of BN... Unless stated otherwise, we concatenate 15 examples from each class for each task in the context for both meta-training and meta-testing... we initialize the query sub-matrix in the self-referential weight matrix using a normal distribution with a mean value of 0 and standard deviation of 0.01/ dhead while other sub-matrices use an std of 1/ dhead (motivated by the fact that a generated query vector is immediately multiplied with the same SRWM to produce a value vector).