reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mitigate the Gap: Improving Cross-Modal Alignment in CLIP

Authors: Sedigheh Eslami, Gerard de Melo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experimental results support that the two aforementioned refinements show substantial improvement of the cross-modal alignment while improving the performance across a wide variety of downstream tasks. We conduct zero-shot classification experiments on Image Net-1K (Russakovsky et al., 2015), CIFAR-100, CIFAR-10 (Krizhevsky et al., 2009), Flowers-102 (Nilsback & Zisserman, 2008), and Stanford Cars (Krause et al., 2013).
Researcher Affiliation	Academia	Sedigheh Eslami Hasso Plattner Institute/University of Potsdam Potsdam, Germany EMAIL Gerard de Melo Hasso Plattner Institute/University of Potsdam Potsdam, Germany EMAIL
Pseudocode	No	The paper describes methods and equations but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	The source code and model checkpoints for reproducing our experiments are available at https://github.com/sarah ESL/Align CLIP.
Open Datasets	Yes	We used the Conceptual Caption 12M (CC12M) dataset (Changpinyo et al., 2021) for pre-training the models. We conduct zero-shot classification experiments on Image Net-1K (Russakovsky et al., 2015), CIFAR-100, CIFAR-10 (Krizhevsky et al., 2009), Flowers-102 (Nilsback & Zisserman, 2008), and Stanford Cars (Krause et al., 2013). We use the Image Net V2, Image Net-R, Image Net-A, and Image Net Sketch datasets for these evaluations. In addition to classification, we evaluate Shared CLIP and Align CLIP in the applications of zero-shot and fine-tuned image-to-text and text-to-image retrieval using the MSCOCO (Lin et al., 2014) and Flickr30K (Plummer et al., 2015) datasets.
Dataset Splits	Yes	We start by reporting and comparing the alignment scores when using CLIP, Shared CLIP, and Align CLIP models on the validation sets from CC3M, MSCOCO as well as the Image Net-1K, CIFAR-100, and CIFAR-10 test datasets. For all datasets, we train the linear classifier layer with a batch size of 128, for 30 epochs, with Adam W optimization, and a cosine scheduler with a starting learning rate of 5 10 4.
Hardware Specification	Yes	Each model was trained using an NVIDIA H100 GPU with batch size 512 for 30 epochs.
Software Dependencies	No	The paper mentions 'SBERT all-mpnet-base-v2 model' as a pre-trained semantic encoder and 'Open CLIP implementation' but does not provide specific version numbers for general software libraries or frameworks like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We adopted a transformer encoder consisting of 12 layers and 12 heads in CLIP, Shared CLIP, and Align CLIP. The image patch size for encoding visual data is set to 16 16. When encoding texts, the maximum sequence length is set to 77 tokens and the vocabulary size for the embedding layer is set to 49,408. The output embedding dimensionality for both the vision and language modalities is set to 768. For all models, we used Adam W optimization with a starting learning rate of 1 10 3, cosine scheduler, 10,000 warmup steps, and a weight decay of 0.1. The initial temperature value for all models were set to 0.07. Each model was trained using an NVIDIA H100 GPU with batch size 512 for 30 epochs. In Align CLIP, we set α = 0.5. For all datasets, we train the linear classifier layer with a batch size of 128, for 30 epochs, with Adam W optimization, and a cosine scheduler with a starting learning rate of 5 10 4. When fine-tuning, the batch size was set to 128 and the Adam W optimizer with learning rate 5 10 6 and a weight decay of 0.2 was used. For the fine-tuning experiments, each model was fine-tuned for 8 and 20 epochs on MSCOCO and Flickr, respectively.