reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

C-CLIP: Multimodal Continual Learning for Vision-Language Model

Authors: Wenzhuo Liu, Fei Zhu, Longhui Wei, Qi Tian

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments demonstrate that our method has strong continual learning ability across diverse image-text datasets, maintaining zero-shot prediction capabilities with minimal forgetting and significantly outperforming existing methods.
Researcher Affiliation	Collaboration	Wenzhuo Liu1,2 Fei Zhu3 Longhui Wei4 Qi Tian4 1School of Artificial Intelligence, UCAS 2State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA 3Centre for Artificial Intelligence and Robotics, HKISI-CAS 4Huawei Inc EMAIL, EMAIL
Pseudocode	No	The paper describes the proposed method C-CLIP using textual explanations, mathematical equations (Eq. 1, 2, 3, 4), and diagrams (Figure 2, Figure 4), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	Yes	Code available at https://github.com/Small Pig Peppa/C-CLIP
Open Datasets	Yes	Eight image-caption datasets are used in this track. Among them, Flickr30K (Plummer et al., 2015) and COCO (Chen et al., 2015) are general real-world datasets. Other datasets, including Pets (Parkhi et al., 2012), Lexica (Shen et al., 2024), Simpsons, Wiki Art (Saleh & Elgammal, 2015), Kream, and Sketch (Chowdhury et al., 2022)... One held image-caption dataset, i.e., HAVG (Abdulmumin et al., 2022)... Image Net (Deng et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), Stanford Cars (Krause et al., 2013), Flowers (Nilsback & Zisserman, 2008), DTD (Cimpoi et al., 2014), and Food101 (Bossard et al., 2014).
Dataset Splits	Yes	Some image caption datasets have predefined splits, such as Flickr30K and COCO-caption, with test sets of 1K and 5K, respectively. For Pet, Lexica, and Hausa VG, we evaluate their test sets. For other datasets like Simpsons, Sketch, and Wikiart, we randomly split 80% for training and 20% for testing. For Kream, the training and test sets are evenly divided.
Hardware Specification	Yes	C-CLIP is implemented in Py Torch lightning and trained on 8 NVIDIA 4090 GPUs with a batch size of 1024.
Software Dependencies	No	C-CLIP is implemented in Py Torch lightning and trained on 8 NVIDIA 4090 GPUs with a batch size of 1024. (Only mentions "Py Torch lightning" without a specific version number, and no other software dependencies with versions are provided.)
Experiment Setup	Yes	trained for 40 epochs on each dataset. The initial learning rate is set to 1 10 6 with a 5-epoch warm-up using a cosine-decay learning rate scheduler. The low-rank decomposition (R) of Lo RA is set to 16, with a scaling factor of 2 R and dropout of 0.1. We use the Adam W optimizer with β1 = 0.9, β2 = 0.99, and a weight decay of 0.2. Learning rates are adjusted per dataset; for example, on COCO-caption (Chen et al., 2015), the image encoder s learning rate is 5 10 7, and the text encoder s is 4 10 5. During training, all images are resized to 224x224, and the maximum text length is set to 77. For COCO-caption, we set the learning rate for the text encoder to 80 times that of the image encoder, while for other datasets, it was set to 10 times. The base learning rate was 5e-7 for COCO-caption, 1e-5 for Flickr30K, and 3e-5 for other datasets, from Pet to Sketch.