Semi-Supervised CLIP Adaptation by Enforcing Semantic and Trapezoidal Consistency
Authors: Kai Gan, Bo Ye, Min-Ling Zhang, Tong Wei
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our approach significantly improves the adaptability of CLIP in target tasks with limited labeled data, achieving gains ranging from 1.72% 6.58% for zero-shot classification accuracy and 2.32% 3.23% for image-text retrieval performance on standard benchmarks. The source code is available at https://github.com/Gank0078/Semi CLIP. |
| Researcher Affiliation | Academia | 1School of Computer Science and Engineering, Southeast University, Nanjing 210096, China 2Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China EMAIL |
| Pseudocode | No | The paper describes the methodology using textual descriptions and mathematical equations (e.g., Equation 1, 2, 3, 4, 5) but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code is available at https://github.com/Gank0078/Semi CLIP. |
| Open Datasets | Yes | We conduct extensive experiments on four publicly available datasets to evaluate the performance of SEMICLIP. Following previous method S-CLIP (Mo et al., 2023), the datasets include Remote sensing datasets (Yang & Newsam, 2010; Zhang et al., 2014; Lu et al., 2017), Fashion datasets (Han et al., 2017; Rostamzadeh et al., 2018; Vasileva et al., 2018), Sci Cap dataset (Hsu et al., 2021), and Simpsons dataset (Attia, 2018; Adler, 2023). ...we also incorporate the RESISC45 dataset (Cheng et al., 2017) as unlabeled data (L =U). ...we conducted comparative experiments on the COCO (Lin et al., 2014) dataset. |
| Dataset Splits | Yes | Under the default setting, we subsample 10% image-text pairs of the training dataset randomly as labeled data, leaving the rest as unlabeled data. The models are evaluated on zero-shot classification and image-text retrieval tasks, with performance measured by Top-1 classification accuracy (%) and recall at k (R@k). ...We utilize the validation sets from the classification variants of the RSICD and UCM datasets, referred to as RSICD-CLS and UCM-CLS, respectively. ...We conduct comparative experiments on the COCO (Lin et al., 2014) dataset. The results in Tab. 16 indicate that SEMICLIP can achieve significant performance improvements on general benchmark over CLIP (fine-tuned) and S-CLIP. It is worth noting that S-CLIP s performance shows an average decrease of 4.5% compared to CLIP (fine-tuned), aligning with the paper s claim (Mo et al., 2023) that S-CLIP experiences performance drops when trained on a small number of image-text pairs in common datasets like COCO. However, the superior performance of our proposed SEMICLIP is unaffected by the different types of datasets, achieving significant improvements on both commonly used datasets and taskspecific datasets. 1% labeled 10% labeled |
| Hardware Specification | Yes | All experiments are conducted on four NVIDIA A6000 GPUs with a batch size of 64 per GPU. |
| Software Dependencies | No | The paper mentions the use of NLTK (Bird et al., 2009) for concept extraction, AdamW (Loshchilov, 2017) as an optimizer, and the CLIP model (Ilharco et al., 2021) as the backbone, but it does not specify version numbers for general software dependencies like programming languages or libraries (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | We utilize Vi T-B-16 as the default vision encoder in our experiments, and experiments related to other vision encoders are shown in Appendix A.1. We train the model 25 epochs in the supervised pre-training, and 15 epochs for semi-supervised fine-tuning. We employ Adam W (Loshchilov, 2017) alongside a weight decay set at 5 10 4 and apply the default cosine learning rate scheduling with warmup for the first 10 steps. The learning rate is set to 5 10 5 for remote sensing and fashion datasets and 1 10 6 for Sci Cap and Simpsons datasets. We establish default values of 30 for P and 4 for k. |