Hyperparameters in Continual Learning: A Reality Check
Authors: Sungmin Cha, Kyunghyun Cho
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across more than 8,000 experiments, our results show that most state-of-the-art algorithms fail to replicate their reported performance, highlighting that their CL capacity has been significantly overestimated in the conventional evaluation protocol. |
| Researcher Affiliation | Collaboration | Sungmin Cha EMAIL New York University Kyunghyun Cho EMAIL New York University & Genentech |
| Pseudocode | Yes | Algorithm 1: The Generalizable Two-phase Evaluation Protocol Algorithm 2: Pseudo algorithm of the hyperparameter huning phase Algorithm 3: Pseudo algorithm of the evaluation phase |
| Open Source Code | No | The paper states, "We conduct experiments using the implementation code proposed in Py CIL (Zhou et al., 2023a)" and "All experiments are conducted using code implemented in PILOT (Sun et al., 2023)" referring to third-party tools, but there is no explicit statement or link indicating the release of the authors' own code for the methodology described in this paper. |
| Open Datasets | Yes | We conduct the hyperparameter tuning and evaluation phases using benchmark datasets, as shown in Table 1. From Image Net-1k (Deng et al., 2009), we derive two subsets, Image Net-100-1 and Image Net-100-2, each containing 100 randomly selected non-overlapping classes. To account for varying dataset similarities, we further divide CIFAR-100 (Krizhevsky et al., 2009) and Image Net-100-1 into disjoint classes, generating CIFAR-50-1, CIFAR-50-2, Image Net-50-1, and Image Net-50-2. [...] using widely used datasets in class-incremental learning (class IL) with pretrained models, including CUB-200 (Wah et al., 2011), Image Net-R (Hendrycks et al., 2021a), and Image Net A (Hendrycks et al., 2021b) |
| Dataset Splits | Yes | First, a CL scenario is constructed using a benchmark dataset, where each task has its own training, validation, and test sets. [...] Both phases share the same CL scenario configuration (e.g., the number of tasks and number of classes in each task) but they are generated from distinct datasets (DHT = DE). [...] Algorithm 2: Pseudo algorithm of the hyperparameter huning phase: [...] DHT tr , DHT val F(Shuffle(DHT )) [...] Algorithm 3: Pseudo algorithm of the evaluation phase: [...] DE tr, DE val F(Shuffle(DE)) |
| Hardware Specification | No | The paper mentions "GPU usage" and "NYU IT High Performance Computing resources" and CUDA 11.7, but does not specify concrete hardware models (e.g., specific GPU models like NVIDIA A100, CPU types, or memory amounts). |
| Software Dependencies | Yes | We conduct all experiments using Py CIL (Zhou et al., 2023a) in the following environment: Python 3.8, Py Torch 1.13.1, and CUDA 11.7. We use Res Net-18 and Res Net-32 architectures for our experiments. [...] The experimental setup closely followed PILOT s environment, using Python 3.8, Py Torch 2.0.1, and CUDA 11.7. |
| Experiment Setup | Yes | Table 3: Hyperparameters for training the first task. Init epochs 200 Init learning rate 0.1 Init milestones [60, 120, 170] (Only applied when Step LR is selected) Init learning rate decay 0.1 Init weight decay 0.0005 [...] Table 4: The predefined set of hyperparametes for class-IL without a pretrained model. [...] (followed by detailed lists of hyperparameters for various algorithms and scenarios) |