DiffCLIP: Differential Attention Meets CLIP

Authors: Hasan Abed Al Kader Hammoud, Bernard Ghanem

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Diff CLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, Diff CLIP consistently outperforms baseline CLIP models.
Researcher Affiliation Academia Hasan Abed Al Kader Hammoud EMAIL King Abdullah University of Science and Technology (KAUST) Bernard Ghanem EMAIL King Abdullah University of Science and Technology (KAUST)
Pseudocode No The paper describes the Transformer attention mechanism, differential attention, and CLIP training using mathematical formulations and conceptual explanations in Sections 3.1, 3.2, and 3.3, but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code Yes Code and models can be found at https://github.com/hammoudhasan/Diff CLIP.
Open Datasets Yes We pretrain on Conceptual Captions 3M (CC3M) (Sharma et al., 2018) and Conceptual Captions 12M (CC12M) (Changpinyo et al., 2021). We measure zero-shot robustness on Image Net (Russakovsky et al., 2015) and its variants (Image Net V2 (Recht et al., 2019), Image Net-A (Hendrycks et al., 2021b), Image Net-R (Hendrycks et al., 2021a), and Image Net-Sketch (Wang et al., 2019)). For retrieval (image-to-text and text-to-image) on Flickr8k (Rashtchian et al., 2010), Flickr30k (Young et al., 2014), and MSCOCO (Lin et al., 2014), we use the LAION CLIP Benchmark framework (Schuhmann et al., 2022). The LAION-CC-SBU dataset (558K image-text pairs) used in the LLa VA training setup. For instruction fine-tuning, we adopted the COCO (Lin et al., 2014) subset (approximately 350K pairs) also used by LLa VA.
Dataset Splits Yes We follow established practices for linear probing and few-shot evaluation (El Banani et al., 2023) on nine image-classification datasets: For retrieval (image-to-text and text-to-image) on Flickr8k (Rashtchian et al., 2010), Flickr30k (Young et al., 2014), and MSCOCO (Lin et al., 2014), we use the LAION CLIP Benchmark framework (Schuhmann et al., 2022).
Hardware Specification Yes For CC3M, we train on four A100 GPUs, while CC12M uses eight A100 GPUs to reduce training time. All experiments were conducted using the Tiny LLa VA repository on 4 A100-80GB GPUs.
Software Dependencies No The paper mentions downloading data using img2dataset (Beaumont, 2021) and using Qwen-2.5-Instruct-0.5B (Yang et al., 2024) as a language encoder, but does not specify version numbers for these or any other software libraries, programming languages, or frameworks used for implementation.
Experiment Setup Yes All models train for 40 epochs, using one epoch of linear warmup, a global batch size of 4096, and Adam W optimizer (Loshchilov & Hutter, 2017). We set the base learning rate to 5 10 4 with weight decay of 0.5. For Diff CLIP, every attention layer in both the vision and text encoders is replaced with differential attention. We initialize each layer s λ at 0.8 unless stated otherwise. The hyperparameters for fine-tuning included a batch size of 48 samples per GPU, a learning rate of 2 10 5, zero weight decay, a warm-up ratio of 0.03, and cosine decay scheduling. The projection pretraining similarly employed 48 samples per GPU, a learning rate of 1 10 3, no weight decay, a warm-up ratio of 0.03, and cosine decay scheduling.