reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DiffCLIP: Differential Attention Meets CLIP

Authors: Hasan Abed Al Kader Hammoud, Bernard Ghanem

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Diff CLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, Diff CLIP consistently outperforms baseline CLIP models.
Researcher Affiliation	Academia	Hasan Abed Al Kader Hammoud EMAIL King Abdullah University of Science and Technology (KAUST) Bernard Ghanem EMAIL King Abdullah University of Science and Technology (KAUST)
Pseudocode	No	The paper describes the Transformer attention mechanism, differential attention, and CLIP training using mathematical formulations and conceptual explanations in Sections 3.1, 3.2, and 3.3, but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	Code and models can be found at https://github.com/hammoudhasan/Diff CLIP.
Open Datasets	Yes	We pretrain on Conceptual Captions 3M (CC3M) (Sharma et al., 2018) and Conceptual Captions 12M (CC12M) (Changpinyo et al., 2021). We measure zero-shot robustness on Image Net (Russakovsky et al., 2015) and its variants (Image Net V2 (Recht et al., 2019), Image Net-A (Hendrycks et al., 2021b), Image Net-R (Hendrycks et al., 2021a), and Image Net-Sketch (Wang et al., 2019)). For retrieval (image-to-text and text-to-image) on Flickr8k (Rashtchian et al., 2010), Flickr30k (Young et al., 2014), and MSCOCO (Lin et al., 2014), we use the LAION CLIP Benchmark framework (Schuhmann et al., 2022). The LAION-CC-SBU dataset (558K image-text pairs) used in the LLa VA training setup. For instruction fine-tuning, we adopted the COCO (Lin et al., 2014) subset (approximately 350K pairs) also used by LLa VA.
Dataset Splits	Yes	We follow established practices for linear probing and few-shot evaluation (El Banani et al., 2023) on nine image-classification datasets: For retrieval (image-to-text and text-to-image) on Flickr8k (Rashtchian et al., 2010), Flickr30k (Young et al., 2014), and MSCOCO (Lin et al., 2014), we use the LAION CLIP Benchmark framework (Schuhmann et al., 2022).
Hardware Specification	Yes	For CC3M, we train on four A100 GPUs, while CC12M uses eight A100 GPUs to reduce training time. All experiments were conducted using the Tiny LLa VA repository on 4 A100-80GB GPUs.
Software Dependencies	No	The paper mentions downloading data using img2dataset (Beaumont, 2021) and using Qwen-2.5-Instruct-0.5B (Yang et al., 2024) as a language encoder, but does not specify version numbers for these or any other software libraries, programming languages, or frameworks used for implementation.
Experiment Setup	Yes	All models train for 40 epochs, using one epoch of linear warmup, a global batch size of 4096, and Adam W optimizer (Loshchilov & Hutter, 2017). We set the base learning rate to 5 10 4 with weight decay of 0.5. For Diff CLIP, every attention layer in both the vision and text encoders is replaced with differential attention. We initialize each layer s λ at 0.8 unless stated otherwise. The hyperparameters for fine-tuning included a batch size of 48 samples per GPU, a learning rate of 2 10 5, zero weight decay, a warm-up ratio of 0.03, and cosine decay scheduling. The projection pretraining similarly employed 48 samples per GPU, a learning rate of 1 10 3, no weight decay, a warm-up ratio of 0.03, and cosine decay scheduling.