C-CLIP: Multimodal Continual Learning for Vision-Language Model
Authors: Wenzhuo Liu, Fei Zhu, Longhui Wei, Qi Tian
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments demonstrate that our method has strong continual learning ability across diverse image-text datasets, maintaining zero-shot prediction capabilities with minimal forgetting and significantly outperforming existing methods. |
| Researcher Affiliation | Collaboration | Wenzhuo Liu1,2 Fei Zhu3 Longhui Wei4 Qi Tian4 1School of Artificial Intelligence, UCAS 2State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA 3Centre for Artificial Intelligence and Robotics, HKISI-CAS 4Huawei Inc EMAIL, EMAIL |
| Pseudocode | No | The paper describes the proposed method C-CLIP using textual explanations, mathematical equations (Eq. 1, 2, 3, 4), and diagrams (Figure 2, Figure 4), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | Code available at https://github.com/Small Pig Peppa/C-CLIP |
| Open Datasets | Yes | Eight image-caption datasets are used in this track. Among them, Flickr30K (Plummer et al., 2015) and COCO (Chen et al., 2015) are general real-world datasets. Other datasets, including Pets (Parkhi et al., 2012), Lexica (Shen et al., 2024), Simpsons, Wiki Art (Saleh & Elgammal, 2015), Kream, and Sketch (Chowdhury et al., 2022)... One held image-caption dataset, i.e., HAVG (Abdulmumin et al., 2022)... Image Net (Deng et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), Stanford Cars (Krause et al., 2013), Flowers (Nilsback & Zisserman, 2008), DTD (Cimpoi et al., 2014), and Food101 (Bossard et al., 2014). |
| Dataset Splits | Yes | Some image caption datasets have predefined splits, such as Flickr30K and COCO-caption, with test sets of 1K and 5K, respectively. For Pet, Lexica, and Hausa VG, we evaluate their test sets. For other datasets like Simpsons, Sketch, and Wikiart, we randomly split 80% for training and 20% for testing. For Kream, the training and test sets are evenly divided. |
| Hardware Specification | Yes | C-CLIP is implemented in Py Torch lightning and trained on 8 NVIDIA 4090 GPUs with a batch size of 1024. |
| Software Dependencies | No | C-CLIP is implemented in Py Torch lightning and trained on 8 NVIDIA 4090 GPUs with a batch size of 1024. (Only mentions "Py Torch lightning" without a specific version number, and no other software dependencies with versions are provided.) |
| Experiment Setup | Yes | trained for 40 epochs on each dataset. The initial learning rate is set to 1 10 6 with a 5-epoch warm-up using a cosine-decay learning rate scheduler. The low-rank decomposition (R) of Lo RA is set to 16, with a scaling factor of 2 R and dropout of 0.1. We use the Adam W optimizer with β1 = 0.9, β2 = 0.99, and a weight decay of 0.2. Learning rates are adjusted per dataset; for example, on COCO-caption (Chen et al., 2015), the image encoder s learning rate is 5 10 7, and the text encoder s is 4 10 5. During training, all images are resized to 224x224, and the maximum text length is set to 77. For COCO-caption, we set the learning rate for the text encoder to 80 times that of the image encoder, while for other datasets, it was set to 10 times. The base learning rate was 5e-7 for COCO-caption, 1e-5 for Flickr30K, and 3e-5 for other datasets, from Pet to Sketch. |