Continual Learning in Open-vocabulary Classification with Complementary Memory Systems

Authors: Zhen Zhu, Weijie Lyu, Yao Xiao, Derek Hoiem

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test in data incremental, class incremental, and task incremental settings, as well as ability to perform flexible inference on varying subsets of zero-shot and learned categories. Our proposed method achieves a good balance of learning speed, target task effectiveness, and zero-shot effectiveness. Code is available at https://github.com/jessemelpolio/Tree Probe. We evaluate our system on target and zero-shot classification tasks.
Researcher Affiliation Academia Zhen Zhu EMAIL Siebel School of Computing and Data Science University of Illinois Urbana-Champaign Weijie Lyu EMAIL University of California Merced Yao Xiao EMAIL Siebel School of Computing and Data Science University of Illinois Urbana-Champaign Derek Hoiem EMAIL Siebel School of Computing and Data Science University of Illinois Urbana-Champaign
Pseudocode Yes S-1 Algorithmic descriptions of Tree Probe Algorithm 1 Training Procedure of Tree Probe Require: Training set X, Tree T, Leaf capacity ψ Ensure: Trained classifiers in each leaf node of T 1: for all vi X do 2: l = Nearest Leaf(vi, T) 3: if Count(l) < ψ then 4: l = Insert Data(vi, l) 5: Train Classifier(l) 6: else 7: Split Node(l, vi) 8: l = Nearest Leaf(vi, T) 9: l = Insert Data(vi, l) 10: Train Classifier(l) 11: end if 12: end for Algorithm 2 Inference Procedure of Tree Probe Require: Image embedding v I, Tree T, Number of nearest nodes k, Exemplar set M Ensure: Exemplar embedding ve for v I 1: v K = Find Nearest Samples(v I, M) 2: v = list() 3: for all vi v K do 4: l = Nearest Leaf(vi, T) 5: c = Get Classifier(l) 6: v Classify(v I, c) 7: end for 8: ve = Compute Embedding(v, v I)
Open Source Code Yes Code is available at https://github.com/jessemelpolio/Tree Probe.
Open Datasets Yes We utilize general tasks such as Image Net (Russakovsky et al., 2015), SUN397 (Xiao et al., 2010), CIFAR100 (Krizhevsky & Hinton, 2009), and fine-grained tasks like Euro SAT (Helber et al., 2019), Oxford IIITPets (Parkhi et al., 2012), DTD (Cimpoi et al., 2014), Flower102 (Nilsback & Zisserman, 2008), FGVCAircraft (Maji et al., 2013), Stanford Cars (Krause et al., 2013), Food101 (Bossard et al., 2014), UCF101 (Soomro et al., 2012).
Dataset Splits Yes Data incremental learning includes seven stages, comprising 2%, 4%, 8%, 16%, 32%, 64%, and 100% of task data, respectively. Class incremental learning divides a task into five stages, each containing 20% of classes. In task incremental learning, each task is considered a stage. In data and class incremental experiments, models are built separately for each target task. A target task is fully evaluated if there is at least one training sample for that task, even if there are no training samples for some classes. In task incremental, one model is built spanning all accumulated labels in each stage. In all cases, results are reported as the average accuracy of target and zero-shot tasks at each stage. We additionally compute the averaged accuracy on seen classes and unseen classes of each target task for class incremental learning to give more detailed result analysis. An intuitive demonstration of these continual learning scenarios are shown in Fig. 3. For SUN397 (Xiao et al., 2010) consists of scene images, containing 108,754 images across 397 scene categories, with each category having between 100 and 500 images. This dataset is commonly used for scene understanding tasks. Since there is no official dataset split for this dataset, we randomly select 60% of images as training data, 20% as validation data, and the rest as test data. We use Num Py random permutation to split with the seed set to 0. For Euro SAT (Helber et al., 2019) ... Since there is no official dataset split for this dataset, we randomly select 70% of images as training data and the rest as validation data. We use Num Py random permutation to perform splitting with the seed set to 0. For UCF101 (Soomro et al., 2012) ... Since there is no official dataset split for this dataset, we randomly select 70% of images as training data and the rest as validation data. We use Num Py random permutation to perform splitting with the seed set to 0.
Hardware Specification Yes We conduct our experiments on a setup featuring an RTX 3090 GPU and an AMD Ryzen 9 5950X CPU, using Py Torch as our primary framework.
Software Dependencies No We conduct our experiments on a setup featuring an RTX 3090 GPU and an AMD Ryzen 9 5950X CPU, using Py Torch as our primary framework. We adhere to the CLIP code example, using sklearn Logistic Regression to implement linear classifiers and setting the sklearn regularization strength to 0.316. The maximum iteration is set to 5k. Our tree probe s node capacity is set at 50k. For efficient retrieval from large-scale exemplar sets, we use FAISS (Johnson et al., 2019), specifically using the Index Flat IP class for its precision and performance.
Experiment Setup Yes We adhere to the CLIP code example, using sklearn Logistic Regression to implement linear classifiers and setting the sklearn regularization strength to 0.316. The maximum iteration is set to 5k. Our tree probe s node capacity is set at 50k. For efficient retrieval from large-scale exemplar sets, we use FAISS (Johnson et al., 2019), specifically using the Index Flat IP class for its precision and performance. Model performances are gauged via Top-1 accuracy, with the officially released Vi T-B/32 CLIP checkpoint serving as our memory or zero-shot model. We select k = 9 based on a hyperparameter sweep. Our approach is not sensitive to k, with very similar performance in a range from 6 to 30. We choose the learning rate for both methods as the most frequent used learning rate in ZSCL experiments, i.e., 1e-5. For each task, following ZSCL, we warmup training for 100 iterations and proceed to train 1K iterations. For the weight ensemble technique used in ZSCL, we also use an update interval of 100 iterations. Batch size is kept to 64 for both methods, with Adam W (Loshchilov & Hutter, 2019) optimizer and the beta set to 0.9. The base network backbone we use for both methods is CLIP Vi T-B/32, identical to our approach. After sweeping a good set of hyperparameters that lead to smaller number of training epochs while maintaining good performance on held-out classification datasets (Caltech101), we set learning rate to 0.005, optimizer to SGD, weight decay to 0.1, and use 20 epochs for training each stage. We use the cross-entropy loss for training the probes.