TSVD: Bridging Theory and Practice in Continual Learning with Pre-trained Models
Authors: Liangzu Peng, Juan Elenter, Joshua Agterberg, Alejandro Ribeiro, Rene Vidal
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we aim to bridge this gap between theory and practice by designing a simple CL method that is theoretically sound and highly performant. Specifically, we lift pre-trained features into a higher dimensional space and formulate an over-parametrized minimum-norm least-squares problem. We find that the lifted features are highly ill-conditioned, potentially leading to large training errors (numerical instability) and increased generalization errors. We address these challenges by continually truncating the singular value decomposition of the lifted features. Our approach, termed Lo Ran PAC, is stable with respect to the choice of hyperparameters, can handle hundreds of tasks, and outperforms state-of-the-art CL methods on multiple datasets. |
| Researcher Affiliation | Collaboration | Liangzu Peng University of Pennsylvania EMAIL Juan Elenter Litwin 1 Spotify EMAIL Joshua Agterberg University of Illinois Urbana-Champaign EMAIL Alejandro Ribeiro University of Pennsylvania EMAIL René Vidal University of Pennsylvania EMAIL |
| Pseudocode | Yes | Algorithm 1: Continual Solver of Lo Ran PAC (detailed version in Algorithm 4, Appendix C) |
| Open Source Code | Yes | Code available: https://github.com/liangzu/loranpac. |
| Open Datasets | Yes | We run CIL experiments with B-q1, Inc-q2 on continual learning versions of the following datasets: CIFAR100 (Krizhevsky et al., 2009), Image Net-R (Hendrycks et al., 2021a), Image Net-A (Hendrycks et al., 2021b), CUB-200 (Wah et al., 2011), Object Net (Barbu et al., 2019), Omni Benchmark (Zhang et al., 2022), VTAB (Zhai et al., 2019), and Stanford Cars (Krause et al., 2013). |
| Dataset Splits | Yes | Given mt images of task t, we feed them to pre-trained Vi Ts, obtaining the output features Xt Rd mt. Here, d is the feature dimension (d = 768 in the Vi Ts used). Corresponding to Xt is the label matrix Yt Rct mt. Every column of Yt is a one-hot vector, that is some standard basis vector in Rct, where ct is the total number of classes observed so far. We thus have c1 ci ct. Let Mt := m1 + + mt. While Yi Rci mt might have a different number of rows as ci varies, one can pad ct ci zero rows to Yi when new class information is revealed; so, with a slight abuse of notation, Yi is viewed as having ct rows. We denote by Y1:t the label matrix of the first t tasks: Y1:t = [Y1, . . . , Yt] Rct Mt. ... For example, in Table 7, CIFAR100 has 50,000 (Training Set Size) and 10,000 (Test Set Size). |
| Hardware Specification | No | The major cost is the SVD of (12), which takes O((kt 1 + mt)3) time. While in principle the QR orthogonalization for the post-processing of e U1:t takes O(E(kt 1 + mt)2) time, it is significantly faster than SVD as the constants behind its O( ) is very small. Therefore, one would expect the SVD on the matrix of (12) in O((kt 1 + mt)3) time should be much faster than the SVD on the matrix [ e U1:t 1 eΣ1:t 1, Ht], which needs O(E(kt 1 + mt)2) time, where E is far larger than kt 1 + mt (e.g., E = 105 and kt 1 + mt = 104). This is true on a sequential machine, but their running time difference is not significant for highly parallel GPU implementations in our experience (e.g., computing the inner product between two E-dimensional vectors has similar running times to computing the inner product between (kt 1 + mt)-dimensional vectors, due to parallelism). |
| Software Dependencies | No | In more detail, for every task t and every each candidate choice of λ, Ran PAC maintains the covariances H1:t H 1:t, Y1:t H 1:t, to solve the normal equations W (H1:t H 1:t + λIE) = Y1:t H 1:t in variable W using off-the-shelf solvers implemented in Py Torch, which in general takes O(E3) time. |
| Experiment Setup | Yes | We use vision transformers (Vi Ts) of Dosovitskiy et al. (2021) as pre-trained models. ... we use E = 105 unless otherwise specified ... Given ζ, we set the number kt of top SVD factors preserved at task t to kt = (1 ζ) min{E, Mt}. ... For joint linear classifiers, that is LC (X1:T ) or LC (H1:T ), we train for 20 epochs using the cross-entropy loss, batch size 48, weight decay 0.0005, and SGD with the cosine annealing schedule. We run LC (X1:T ) and LC (H1:T ) with different initial learning rates {0.001, 0.005, 0.01, 0.02, 0.03} ... The main hyperparameters of Ran PAC used for first-session adaptation are as follows: {'tuned_epoch':20, 'init_lr':0.01, 'batch_size':48, 'weight_decay':0.0005} ... Table 4: Dataset Truncation Percentage ζ Embedding Dimension E Maximum Allowable Rank rmax CIFAR100 25% 105 10000 |