reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Few-Shot Continual Learning in Vision-Language Models

Authors: Aristeidis Panos, Rahaf Aljundi, Daniel Olmeda Reino, Richard E. Turner

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we demonstrate that updating the image encoder is essential for improving the performance of the VLM that relies on it. More importantly, this approach is computationally efficient, as the image encoder has significantly fewer parameters compared to the language model, especially when updated separately. We conduct a series of experiments under three different few-shot continual learning (FSCL) settings (CL-5, CL-20, and CL-50 shots) to thoroughly investigate the performance of Lo RSU based on ten VQA datasets.
Researcher Affiliation	Collaboration	Aristeidis Panos EMAIL University of Cambridge; Rahaf Aljundi EMAIL Toyota Motor Europe; Daniel Olmeda Reino EMAIL Toyota Motor Europe; Richard E. Turner EMAIL University of Cambridge
Pseudocode	No	The paper describes the proposed method mathematically using equations (1) through (5) and explains the process in paragraph form. However, it does not include a dedicated, clearly labeled pseudocode or algorithm block with structured steps.
Open Source Code	No	The paper does not contain an explicit statement about the release of source code or a link to a code repository for the described methodology.
Open Datasets	Yes	We introduce two novel datasets, TSI and DALLE, created to expose the limitations of pre-trained image encoders in VLMs. TSI Das et al. (2019), a classification dataset of 10K training and 5K test images of 27 activity classes; DALLE, generated by querying DALL E 2, with 660 images from 22 activity classes in TSI. We also use VSR Liu et al. (2023), HM Kiela et al. (2020), MMVP Tong et al. (2024), Vis Only Kamoi et al. (2024), GTS Stallkamp et al. (2012), CAn Wang et al. (2024b), AIR Maji et al. (2013), ESAT Helber et al. (2019).
Dataset Splits	Yes	For FSCL, we split each dataset into 5 sets of disjoint classes/categories and use 5/20/50 shot settings for model fine-tuning. Dataset splits are detailed in Appendix C. More details on how we split each of these datasets for the CL settings are provided in appendix C. GTS Stallkamp et al. (2012). We split the 43 classes of GTS as follows: Session 1: [25, 2, 11, 1, 40, 27, 5, 9, 17]. Session 2: [32, 29, 20, 39, 21, 15, 23, 10, 3]. Session 3: [18, 38, 42, 14, 22, 35, 34, 19, 33]. Session 4: [12, 26, 41, 0, 37, 6, 13, 24]. Session 5: [30, 28, 31, 7, 16, 4, 36, 8].
Hardware Specification	Yes	All the experiments are conducted on a single NVIDIA A100 GPU.
Software Dependencies	No	The paper states: 'We use Py Torch Paszke et al. (2019) to implement all the algorithms.' but does not provide a specific version number for PyTorch. It also mentions 'Adam (Kingma, 2014)' and 'Adam W (Loshchilov, 2017)' as optimizers, but these are not software libraries with specific version numbers.
Experiment Setup	Yes	We set the learning rate 1 × 10^−5 and 2 × 10^−5, for Lo RSU and Lo RSU-Ppl, respectively. We set batch size to 16 for all methods that fine-tune the vision encoder through CLIP loss. We reduce the batch size to 8 for those methods that fine-tune the vision encoder through perplexity loss or those that fine-tune the LLM. All methods run for 20, 15, and 10 epochs for the CL-5, CL-10, and CL-50 settings, respectively. For Lo RA (-Ppl), we set rank r = 64 while Lo RA-L and Lo RA-F use r = 8, for all experiments. For Ada Lo RA, we set the initial rank to 70 and the final average rank to 64. For SPU, we use sparsity=15% for all experiments. For Lo RSU (-Ppl) we use sparsity=10%, rank=64, and we pick the top-2 attention heads for all experiments.