Improving Continual Learning Performance and Efficiency with Auxiliary Classifiers

Authors: Filip Szatkowski, Yaoyue Zheng, Fei Yang, Tomasz Trzcinski, Bartłomiej Twardowski, Joost Van De Weijer

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach on CIFAR100 and Image Net100, each split into 5 and 10 equally sized, disjoint tasks, and present the main results in Table 1. Across all methods and settings, adding ACs consistently improves final performance, with the average relative improvement exceeding 10% of the baseline accuracy in every scenario tested. Interestingly, naive finetuning and EWC exhibit particularly strong gains, highlighting the potential of our simple yet effective idea.
Researcher Affiliation Collaboration 1Warsaw University of Technology 2IDEAS NCBR 3Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University, China 4Computer Vision Center, Barcelona 5VCIP, College of Computer Science, Nankai University 6NKIARI, Shenzhen Futian 7IDEAS Research Institute 8Tooploox 9Universitat Autonoma de Barcelona. Correspondence to: Fei Yang <EMAIL>.
Pseudocode No The paper describes the methodology in prose, including concepts like 'dynamic inference rule' (Equation 1) and 'AC-enhanced CL methods', but it does not present any formal pseudocode blocks or algorithm listings.
Open Source Code Yes Reproducibility. The code used to run experiments in this paper is publicly available at https://github.com/fszatkowski/cl-auxiliary-classifiers.
Open Datasets Yes We perform experiments on CIFAR100 (Krizhevsky, 2009) and Image Net100 (the first 100 classes from Image Net (Deng et al., 2009)), split into tasks containing different numbers of classes.
Dataset Splits Yes We perform experiments on CIFAR100 (Krizhevsky, 2009) and Image Net100 (the first 100 classes from Image Net (Deng et al., 2009)), split into tasks containing different numbers of classes. Unless stated otherwise, we report average accuracy across all tasks at the end of the training. [...] For all exemplar-based methods (Bi C, DER++, ER, GDUMB, LODE, and SSIL), we maintain a fixed-size memory budget of 2000 exemplars, updated after each task. We report results averaged over three random seeds. [...] For the 50-task split, we use a growing memory of 20 exemplars instead of a constant memory of 2000 due to the early tasks containing less samples than the memory limit.
Hardware Specification No The paper mentions 'Polish high-performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2024/017385.' and 'GPU memory usage (in GB)' in Appendix B, but does not provide specific hardware details such as GPU models or CPU types used for the experiments.
Software Dependencies No All our experiments are conducted with the FACIL (Masana et al., 2022) framework. [...] We train the Res Net32 models on CIFAR100 for 200 epochs on each task, using SGD optimizer with a batch size of 128 with a learning rate initialized to 0.1 and decayed by a rate of 0.1 at the 60th, 120th, and 160th epochs. [...] For Vi T, we use Adam W and train each task for 100 epochs with a learning rate of 0.01 and batch size of 64.
Experiment Setup Yes We train the Res Net32 models on CIFAR100 for 200 epochs on each task, using SGD optimizer with a batch size of 128 with a learning rate initialized to 0.1 and decayed by a rate of 0.1 at the 60th, 120th, and 160th epochs. For training Res Net18 on Image Net100, we change the scheduler to cosine with a linear warmup and train for 100 epochs with 5 epochs of warmup... For Vi T, we use Adam W and train each task for 100 epochs with a learning rate of 0.01 and batch size of 64. We also use a cosine scheduler with a linear warmup for 5 epochs. We use a fixed memory of 2000 exemplars selected with herding (Rebuffi et al., 2017). For ER each batch is balanced between old and new data, and for SSIL we use a 4:1 ratio of new to old data. [...] We report results averaged over three random seeds. [...] Similar to (Kaya et al., 2019), to prevent overfitting the network to the early layer classifiers we scale the total loss of each classifier according to its position so that the losses from early classifiers are weighted less than the losses for the final classifier.