DiffCLIP: Few-shot Language-driven Multimodal Classifier

Authors: Jiaqing Zhang, Mingxiang Cao, Xue Yang, Kai Jiang, Yunsong Li

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Diff CLIP on widely used high-dimensional multimodal datasets, demonstrating its effectiveness in addressing few-shot annotated classification tasks. Diff CLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP, while utilizing only 2-shot image-text pairs. Experiments Experiments Setup Datasets Description The experiments are conducted on four widely recognized benchmarks to assess the performance of our proposed method: Houston (Debes et al. 2014), Trento (Rasti, Ghamisi, and Gloaguen 2017), MUUFL (Gader et al. 2013) and MRNet dataset (Bien et al. 2018). Comparison Results Ablation Studies
Researcher Affiliation Academia 1The State Key Laboratory of Integrated Services Networks, Xidian University 2Shanghai AI Laboratory
Pseudocode No The paper describes the methodology using textual explanations and mathematical formulations in sections like "Unsupervised Mask Diffusion" and "Few-shot Language-Driven Classification" but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code https://github.com/icey-zhang/Diff CLIP
Open Datasets Yes The experiments are conducted on four widely recognized benchmarks to assess the performance of our proposed method: Houston (Debes et al. 2014), Trento (Rasti, Ghamisi, and Gloaguen 2017), MUUFL (Gader et al. 2013) and MRNet dataset (Bien et al. 2018).
Dataset Splits Yes For fair comparison, we randomly sample 40 samples per class for training with labels, and the remaining samples for evaluation. ... using 10 samples of MRNet data to train and the rest to test.
Hardware Specification Yes The experiments are conducted on a system with an NVIDIA Ge Force RTX A100 GPU.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as programming languages or libraries. It only mentions general tools like the Adam optimizer and ViT.
Experiment Setup Yes For optimization in both unsupervised and few-shot learning, the Adam optimizer is used with an initial learning rate of 1e-4 and weight decay of 1e-5. Two schedulers are employed: a cosine scheduler for unsupervised learning and a step scheduler for few-shot learning. The training consists of 100 epochs for unsupervised learning and 150 epochs for fewshot learning. To ensure optimal performance in comparative experiments, the batch size is set to 256 for unsupervised learning and 64 for few-shot learning, with consistent parameter settings across all datasets.