reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hyperparameters in Continual Learning: A Reality Check

Authors: Sungmin Cha, Kyunghyun Cho

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across more than 8,000 experiments, our results show that most state-of-the-art algorithms fail to replicate their reported performance, highlighting that their CL capacity has been significantly overestimated in the conventional evaluation protocol.
Researcher Affiliation	Collaboration	Sungmin Cha EMAIL New York University Kyunghyun Cho EMAIL New York University & Genentech
Pseudocode	Yes	Algorithm 1: The Generalizable Two-phase Evaluation Protocol Algorithm 2: Pseudo algorithm of the hyperparameter huning phase Algorithm 3: Pseudo algorithm of the evaluation phase
Open Source Code	No	The paper states, "We conduct experiments using the implementation code proposed in Py CIL (Zhou et al., 2023a)" and "All experiments are conducted using code implemented in PILOT (Sun et al., 2023)" referring to third-party tools, but there is no explicit statement or link indicating the release of the authors' own code for the methodology described in this paper.
Open Datasets	Yes	We conduct the hyperparameter tuning and evaluation phases using benchmark datasets, as shown in Table 1. From Image Net-1k (Deng et al., 2009), we derive two subsets, Image Net-100-1 and Image Net-100-2, each containing 100 randomly selected non-overlapping classes. To account for varying dataset similarities, we further divide CIFAR-100 (Krizhevsky et al., 2009) and Image Net-100-1 into disjoint classes, generating CIFAR-50-1, CIFAR-50-2, Image Net-50-1, and Image Net-50-2. [...] using widely used datasets in class-incremental learning (class IL) with pretrained models, including CUB-200 (Wah et al., 2011), Image Net-R (Hendrycks et al., 2021a), and Image Net A (Hendrycks et al., 2021b)
Dataset Splits	Yes	First, a CL scenario is constructed using a benchmark dataset, where each task has its own training, validation, and test sets. [...] Both phases share the same CL scenario configuration (e.g., the number of tasks and number of classes in each task) but they are generated from distinct datasets (DHT = DE). [...] Algorithm 2: Pseudo algorithm of the hyperparameter huning phase: [...] DHT tr , DHT val F(Shuffle(DHT )) [...] Algorithm 3: Pseudo algorithm of the evaluation phase: [...] DE tr, DE val F(Shuffle(DE))
Hardware Specification	No	The paper mentions "GPU usage" and "NYU IT High Performance Computing resources" and CUDA 11.7, but does not specify concrete hardware models (e.g., specific GPU models like NVIDIA A100, CPU types, or memory amounts).
Software Dependencies	Yes	We conduct all experiments using Py CIL (Zhou et al., 2023a) in the following environment: Python 3.8, Py Torch 1.13.1, and CUDA 11.7. We use Res Net-18 and Res Net-32 architectures for our experiments. [...] The experimental setup closely followed PILOT s environment, using Python 3.8, Py Torch 2.0.1, and CUDA 11.7.
Experiment Setup	Yes	Table 3: Hyperparameters for training the first task. Init epochs 200 Init learning rate 0.1 Init milestones [60, 120, 170] (Only applied when Step LR is selected) Init learning rate decay 0.1 Init weight decay 0.0005 [...] Table 4: The predefined set of hyperparametes for class-IL without a pretrained model. [...] (followed by detailed lists of hyperparameters for various algorithms and scenarios)