Mitigate the Gap: Improving Cross-Modal Alignment in CLIP

Authors: Sedigheh Eslami, Gerard de Melo

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experimental results support that the two aforementioned refinements show substantial improvement of the cross-modal alignment while improving the performance across a wide variety of downstream tasks. We conduct zero-shot classification experiments on Image Net-1K (Russakovsky et al., 2015), CIFAR-100, CIFAR-10 (Krizhevsky et al., 2009), Flowers-102 (Nilsback & Zisserman, 2008), and Stanford Cars (Krause et al., 2013).
Researcher Affiliation Academia Sedigheh Eslami Hasso Plattner Institute/University of Potsdam Potsdam, Germany EMAIL Gerard de Melo Hasso Plattner Institute/University of Potsdam Potsdam, Germany EMAIL
Pseudocode No The paper describes methods and equations but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes The source code and model checkpoints for reproducing our experiments are available at https://github.com/sarah ESL/Align CLIP.
Open Datasets Yes We used the Conceptual Caption 12M (CC12M) dataset (Changpinyo et al., 2021) for pre-training the models. We conduct zero-shot classification experiments on Image Net-1K (Russakovsky et al., 2015), CIFAR-100, CIFAR-10 (Krizhevsky et al., 2009), Flowers-102 (Nilsback & Zisserman, 2008), and Stanford Cars (Krause et al., 2013). We use the Image Net V2, Image Net-R, Image Net-A, and Image Net Sketch datasets for these evaluations. In addition to classification, we evaluate Shared CLIP and Align CLIP in the applications of zero-shot and fine-tuned image-to-text and text-to-image retrieval using the MSCOCO (Lin et al., 2014) and Flickr30K (Plummer et al., 2015) datasets.
Dataset Splits Yes We start by reporting and comparing the alignment scores when using CLIP, Shared CLIP, and Align CLIP models on the validation sets from CC3M, MSCOCO as well as the Image Net-1K, CIFAR-100, and CIFAR-10 test datasets. For all datasets, we train the linear classifier layer with a batch size of 128, for 30 epochs, with Adam W optimization, and a cosine scheduler with a starting learning rate of 5 10 4.
Hardware Specification Yes Each model was trained using an NVIDIA H100 GPU with batch size 512 for 30 epochs.
Software Dependencies No The paper mentions 'SBERT all-mpnet-base-v2 model' as a pre-trained semantic encoder and 'Open CLIP implementation' but does not provide specific version numbers for general software libraries or frameworks like Python, PyTorch, or CUDA.
Experiment Setup Yes We adopted a transformer encoder consisting of 12 layers and 12 heads in CLIP, Shared CLIP, and Align CLIP. The image patch size for encoding visual data is set to 16 16. When encoding texts, the maximum sequence length is set to 77 tokens and the vocabulary size for the embedding layer is set to 49,408. The output embedding dimensionality for both the vision and language modalities is set to 768. For all models, we used Adam W optimization with a starting learning rate of 1 10 3, cosine scheduler, 10,000 warmup steps, and a weight decay of 0.1. The initial temperature value for all models were set to 0.07. Each model was trained using an NVIDIA H100 GPU with batch size 512 for 30 epochs. In Align CLIP, we set α = 0.5. For all datasets, we train the linear classifier layer with a batch size of 128, for 30 epochs, with Adam W optimization, and a cosine scheduler with a starting learning rate of 5 10 4. When fine-tuning, the batch size was set to 128 and the Adam W optimizer with learning rate 5 10 6 and a weight decay of 0.2 was used. For the fine-tuning experiments, each model was fine-tuned for 8 and 20 epochs on MSCOCO and Flickr, respectively.