Learning Continually by Spectral Regularization

Authors: Alex Lewandowski, MichaƂ Bortkiewicz, Saurabh Kumar, Andras Gyorgy, Dale Schuurmans, Mateusz Ostaszewski, Marlos C. Machado

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present an experimental analysis that shows how the proposed spectral regularizer can sustain trainability and performance across a range of model architectures in continual supervised and reinforcement learning settings. Our experiments show that spectral regularization is more performant and less sensitive to hyperparameters than other regularizers across datasets, nonstationarities, and architectures.
Researcher Affiliation Collaboration 1University of Alberta, 2Warsaw University of Technology, 3Stanford University, 4Google Deep Mind, 5Amii, 6Canada CIFAR AI Chair
Pseudocode No The paper describes the methodology using mathematical equations and textual descriptions, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code No The paper does not contain an explicit statement about the release of source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes Our main results uses all commonly used image classification datasets for continual supervised learning: tiny-Image Net (Le and Yang, 2015), CIFAR10, and CIFAR100 (Krizhevsky, 2009). Experiments in the appendix also use smaller-scale datasets, like MNIST (Le Cun et al., 1998), Fashion MNIST (Xiao et al., 2017), EMNIST (Cohen et al., 2017), and SVHN2 (Netzer et al., 2011). In addition to supervised learning, we evaluate spectral regularization in reinforcement learning. Specifically, we investigate the control tasks from the DMC benchmark (Tassa et al., 2020)
Dataset Splits Yes The first 50000 images were used for training, 5000 images from the test set were used for validation and the rest were used for testing. (SVHN2) All of the 50000 images were used for training, 1000 images from the test set were used for validation and the rest were used for testing. (CIFAR10, CIFAR100) All of the 100000 images were used for training, 10000 images were used for validation, and 10000 images were used for testing according to the predetermined split. (tiny-Image Net)
Hardware Specification Yes On a 1080TI, training with spectral regularization is approximately 14% slower.
Software Dependencies No The paper mentions using 'Adam' optimizer and 'soft actor-critic (SAC)' method, but does not provide specific version numbers for any software libraries, frameworks, or solvers used in the implementation.
Experiment Setup Yes All of our experiments used Adam (Kingma and Ba, 2015) where the default step size of 0.001 was selected after an initial sweep over [0.005, 0.001, 0.0005]. For all of our results, we use 10 random seeds and provided a shaded region corresponding to the standard error of the mean. For experiments on tiny-Image Net SVHN2, CIFAR10 and CIFAR100, we used 4 seeds to sweep over the regularization strengths of [0.01, 0.001, 0.0001], and found that 0.0001 worked well on tiny-Image Net, CIFAR10 and CIFAR100 for all regularizers, whereas 0.001 worked best on SVHN2 for all regularizers. For random label non-stationarity, we used a batch size of 500 and 100 epochs per task, with a total of 50 tasks. For both label flipping and pixel permutation non-stationarities, we used a batch size of 500 and 20 epochs per task, with a total of 200 tasks. (EMNIST details are given.) The batch size used was 500. (SVHN2) We used a batch size of 250 and found this to be effective. (CIFAR10, CIFAR100, tiny-Image Net) We choose this setup because a high RR regime leads to significant primacy bias (Nikishin et al., 2022) defined as a tendency to overfit initial experiences that damages the rest of the learning process. We use spectral regularization with a coefficient 1e 4 for both the actor and critic. For every method, we use a single critic and architecture size of 2 layers and 256 neurons per layer for both actor and critic. Using a random policy, we prefill a replay buffer with 10, 000 transitions before starting the training. Replay buffer maximum size is 1 million transitions.