Hyperparameters in Continual Learning: A Reality Check

Authors: Sungmin Cha, Kyunghyun Cho

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across more than 8,000 experiments, our results show that most state-of-the-art algorithms fail to replicate their reported performance, highlighting that their CL capacity has been significantly overestimated in the conventional evaluation protocol.
Researcher Affiliation Collaboration Sungmin Cha EMAIL New York University Kyunghyun Cho EMAIL New York University & Genentech
Pseudocode Yes Algorithm 1: The Generalizable Two-phase Evaluation Protocol Algorithm 2: Pseudo algorithm of the hyperparameter huning phase Algorithm 3: Pseudo algorithm of the evaluation phase
Open Source Code No The paper states, "We conduct experiments using the implementation code proposed in Py CIL (Zhou et al., 2023a)" and "All experiments are conducted using code implemented in PILOT (Sun et al., 2023)" referring to third-party tools, but there is no explicit statement or link indicating the release of the authors' own code for the methodology described in this paper.
Open Datasets Yes We conduct the hyperparameter tuning and evaluation phases using benchmark datasets, as shown in Table 1. From Image Net-1k (Deng et al., 2009), we derive two subsets, Image Net-100-1 and Image Net-100-2, each containing 100 randomly selected non-overlapping classes. To account for varying dataset similarities, we further divide CIFAR-100 (Krizhevsky et al., 2009) and Image Net-100-1 into disjoint classes, generating CIFAR-50-1, CIFAR-50-2, Image Net-50-1, and Image Net-50-2. [...] using widely used datasets in class-incremental learning (class IL) with pretrained models, including CUB-200 (Wah et al., 2011), Image Net-R (Hendrycks et al., 2021a), and Image Net A (Hendrycks et al., 2021b)
Dataset Splits Yes First, a CL scenario is constructed using a benchmark dataset, where each task has its own training, validation, and test sets. [...] Both phases share the same CL scenario configuration (e.g., the number of tasks and number of classes in each task) but they are generated from distinct datasets (DHT = DE). [...] Algorithm 2: Pseudo algorithm of the hyperparameter huning phase: [...] DHT tr , DHT val F(Shuffle(DHT )) [...] Algorithm 3: Pseudo algorithm of the evaluation phase: [...] DE tr, DE val F(Shuffle(DE))
Hardware Specification No The paper mentions "GPU usage" and "NYU IT High Performance Computing resources" and CUDA 11.7, but does not specify concrete hardware models (e.g., specific GPU models like NVIDIA A100, CPU types, or memory amounts).
Software Dependencies Yes We conduct all experiments using Py CIL (Zhou et al., 2023a) in the following environment: Python 3.8, Py Torch 1.13.1, and CUDA 11.7. We use Res Net-18 and Res Net-32 architectures for our experiments. [...] The experimental setup closely followed PILOT s environment, using Python 3.8, Py Torch 2.0.1, and CUDA 11.7.
Experiment Setup Yes Table 3: Hyperparameters for training the first task. Init epochs 200 Init learning rate 0.1 Init milestones [60, 120, 170] (Only applied when Step LR is selected) Init learning rate decay 0.1 Init weight decay 0.0005 [...] Table 4: The predefined set of hyperparametes for class-IL without a pretrained model. [...] (followed by detailed lists of hyperparameters for various algorithms and scenarios)