Efficient Few-Shot Continual Learning in Vision-Language Models
Authors: Aristeidis Panos, Rahaf Aljundi, Daniel Olmeda Reino, Richard E. Turner
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate that updating the image encoder is essential for improving the performance of the VLM that relies on it. More importantly, this approach is computationally efficient, as the image encoder has significantly fewer parameters compared to the language model, especially when updated separately. We conduct a series of experiments under three different few-shot continual learning (FSCL) settings (CL-5, CL-20, and CL-50 shots) to thoroughly investigate the performance of Lo RSU based on ten VQA datasets. |
| Researcher Affiliation | Collaboration | Aristeidis Panos EMAIL University of Cambridge; Rahaf Aljundi EMAIL Toyota Motor Europe; Daniel Olmeda Reino EMAIL Toyota Motor Europe; Richard E. Turner EMAIL University of Cambridge |
| Pseudocode | No | The paper describes the proposed method mathematically using equations (1) through (5) and explains the process in paragraph form. However, it does not include a dedicated, clearly labeled pseudocode or algorithm block with structured steps. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We introduce two novel datasets, TSI and DALLE, created to expose the limitations of pre-trained image encoders in VLMs. TSI Das et al. (2019), a classification dataset of 10K training and 5K test images of 27 activity classes; DALLE, generated by querying DALL E 2, with 660 images from 22 activity classes in TSI. We also use VSR Liu et al. (2023), HM Kiela et al. (2020), MMVP Tong et al. (2024), Vis Only Kamoi et al. (2024), GTS Stallkamp et al. (2012), CAn Wang et al. (2024b), AIR Maji et al. (2013), ESAT Helber et al. (2019). |
| Dataset Splits | Yes | For FSCL, we split each dataset into 5 sets of disjoint classes/categories and use 5/20/50 shot settings for model fine-tuning. Dataset splits are detailed in Appendix C. More details on how we split each of these datasets for the CL settings are provided in appendix C. GTS Stallkamp et al. (2012). We split the 43 classes of GTS as follows: Session 1: [25, 2, 11, 1, 40, 27, 5, 9, 17]. Session 2: [32, 29, 20, 39, 21, 15, 23, 10, 3]. Session 3: [18, 38, 42, 14, 22, 35, 34, 19, 33]. Session 4: [12, 26, 41, 0, 37, 6, 13, 24]. Session 5: [30, 28, 31, 7, 16, 4, 36, 8]. |
| Hardware Specification | Yes | All the experiments are conducted on a single NVIDIA A100 GPU. |
| Software Dependencies | No | The paper states: 'We use Py Torch Paszke et al. (2019) to implement all the algorithms.' but does not provide a specific version number for PyTorch. It also mentions 'Adam (Kingma, 2014)' and 'Adam W (Loshchilov, 2017)' as optimizers, but these are not software libraries with specific version numbers. |
| Experiment Setup | Yes | We set the learning rate 1 × 10^−5 and 2 × 10^−5, for Lo RSU and Lo RSU-Ppl, respectively. We set batch size to 16 for all methods that fine-tune the vision encoder through CLIP loss. We reduce the batch size to 8 for those methods that fine-tune the vision encoder through perplexity loss or those that fine-tune the LLM. All methods run for 20, 15, and 10 epochs for the CL-5, CL-10, and CL-50 settings, respectively. For Lo RA (-Ppl), we set rank r = 64 while Lo RA-L and Lo RA-F use r = 8, for all experiments. For Ada Lo RA, we set the initial rank to 70 and the final average rank to 64. For SPU, we use sparsity=15% for all experiments. For Lo RSU (-Ppl) we use sparsity=10%, rank=64, and we pick the top-2 attention heads for all experiments. |