Self-Normalized Resets for Plasticity in Continual Learning

Authors: Vivek Farias, Adam Jozefiak

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that SNR consistently attains superior performance compared to its competitor algorithms. We also demonstrate that SNR is robust to its sole hyperparameter, its rejection percentile threshold, while competitor algorithms show significant sensitivity. ... We evaluate the efficacy and robustness of SNR on a series of benchmark problems from the continual learning literature, measuring regret with respect to prediction accuracy... Table 1 shows that across all four problems (PM, RM, RC, CI) and both SGD and Adam, SNR consistently attains the largest average accuracy on the final 10% of tasks.
Researcher Affiliation Academia Vivek F. Farias Sloan School of Management Massachusetts Institute of Technology Cambridge, MA 02139, USA EMAIL Adam D. Jozefiak Operations Research Center Massachusetts Institute of Technology Cambridge, MA 02139, USA EMAIL
Pseudocode Yes Algorithm 1: SNR: Self-Normalized Resets Input: Reset percentile threshold η Initialize: Initialize weights θ0 randomly. Set inter-firing time ai = 0 for each neuron i for each training example xt do Forward Pass: Evaluate f(xt; θt). Get neuron activations zt,i for each neuron i Update inter-firing times: For each neuron i, ai ai + 1 if zt,i = 0. Otherwise, ai 0 Optimize: θt+1 Ot(Ht) Resets: For each neuron i, reset if P(Aµt i ai) η. end
Open Source Code Yes Corresponding author. Code: https://github.com/ajozefiak/Self Normalized Resets.
Open Datasets Yes Permuted MNIST (PM) (Goodfellow et al., 2013; Dohare et al., 2021; Kumar et al., 2023b): A subset of 10000 image-label pairs from the MNIST dataset are sampled for an experiment. ... Random Label CIFAR (RC) (Kumar et al., 2023b; Lyle et al., 2023): A subset of 128 images from the CIFAR-10 dataset are sampled for an experiment. ... Continual Imagenet (CI) (Dohare et al., 2023; Kumar et al., 2023b): An experiment consists of all 1000 classes of images from the Image Net-32 dataset (Chrabaszcz et al., 2017) containing 600 images from each class.
Dataset Splits No The paper specifies counts of images/tokens sampled per task and the structure of tasks (e.g., "A subset of 10000 image-label pairs from the MNIST dataset are sampled for an experiment"), number of tasks, and how images are used within tasks. It also mentions a "test loss on a holdout set for each task" in Section A.2 for generalization experiments. However, it does not provide explicit training, validation, and test splits (e.g., percentages or exact counts) for the *overall base datasets* (like MNIST, CIFAR-10, ImageNet-32) that would be needed for a standard, reproducible dataset partitioning before the continual learning tasks begin.
Hardware Specification No The paper does not explicitly mention any specific hardware used for running its experiments, such as GPU models, CPU types, or memory specifications. It only refers to general network architectures like MLPs, CNNs, and Transformers.
Software Dependencies No The paper mentions optimizers like SGD and Adam, and a 'GPT-2 BPE tokenizer' but does not provide specific version numbers for any software libraries, frameworks, or tools used in the implementation or execution of the experiments.
Experiment Setup Yes With SGD we train with learning rate 10-2 on all problems except Random Label MNIST, for which we train with learning rate 10-1. With Adam we train with learning rate 10-3 on all problems, including Permuted Shakespeare and we use the standard parameters of β1 = 0.9, β2 = 0.999, and ϵ = 10-7. ... We perform an initial hyperparameter sweep over 5 seeds to determine the optimal choice of hyperparameters (see Appendix C). For each algorithm and problem, we select the hyperparameters that attain the lowest average loss and repeat the experiment on 5 new random seeds. ... Each task consists of a single epoch and the network receives data in batches of size 16. ... An agent is trained for 400 epochs on each task with a batch size 16.