Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion

Authors: Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Andrew Bagdanov

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct an extensive study of the behavior of intra-modal similarities on the intra-modal tasks of image-to-image and text-to-text retrieval. We perform this analysis by transforming intra-modal tasks into inter-modal ones to leverage CLIP s inter-modal alignment. Our experiments show that tackling intra-modal tasks inter-modally via modality inversion as illustrated in the right side of Fig. 1 outperforms intra-modal baselines on more than fifteen datasets. To additionally support our claim that this performance improvement stems from inter-modal alignment and not the modality inversion process itself, we transform inter-modal tasks into intra-modal ones. Specifically, we show that applying modality inversion to the inherently inter-modal zero-shot image classification task yields worse performance than the inter-modal baseline.
Researcher Affiliation Academia 1University of Florence, Media Integration and Communication Center (MICC), Italy 2University of Pisa, Italy {name.surname}@unifi.it
Pseudocode Yes Algorithm 1 Optimization-based Textual Inversion (OTI) ... Algorithm 2 Optimization-based Visual Inversion (OVI)
Open Source Code Yes The code is publicly available at: https://github.com/miccunifi/Cross-the-Gap.
Open Datasets Yes We consider a total of 15 datasets commonly employed for image-to-image retrieval and image classification. ... Image Net (Deng et al., 2009) ... Caltech101 (Fei-Fei et al., 2004) ... Euro SAT (Helber et al., 2019) ... Food101 (Bossard et al., 2014) ... FGVCAircraft (Maji et al., 2013) ... Oxford Pets (Parkhi et al., 2012) ... Flowers102 (Nilsback & Zisserman, 2008) ... Stanford Cars (Krause et al., 2013) ... UCF101 (Soomro et al., 2012) ... Describable Textures Dataset (DTD) (Cimpoi et al., 2014) ... CUB-200-2011 (CUB) (Wah et al., 2011), Stanford Online Products (SOP) (Oh Song et al., 2016), ROxford (Radenovi c et al., 2018), and RParis (Radenovi c et al., 2018)... COCO (Lin et al., 2014), Flickr30K (Plummer et al., 2015), and nocaps (Agrawal et al., 2019).
Dataset Splits Yes In the 11 datasets used for zero-shot image classification, we use the test set as the query set and the training set as the gallery. For CUB, the entire dataset is used as both the query and gallery sets. In SOP, both the query and gallery sets are taken from the test set. ... We use the Karpathy split (Karpathy & Fei-Fei, 2015) for both COCO and Flickr30K and report results using captions from the test split. For nocaps, we report results on the validation split.
Hardware Specification Yes On average, when using the CLIP Vi T/B-32 model, OTI takes approximately 0.2 seconds per image, while OVI takes around 0.5 seconds per text prompt on a single A100 GPU (40GBs) with a batch size of 2048.
Software Dependencies No The paper mentions the 'Adam W Loshchilov & Hutter (2019)' optimizer, and the 'Llama-3.2-1B-Instruct2 Large Language Model (Dubey et al., 2024)', but does not explicitly list software dependencies with specific version numbers for libraries or frameworks used for implementation.
Experiment Setup Yes Unless stated otherwise, we use the same hyperparameters for OTI and OVI. We employ the Adam W Loshchilov & Hutter (2019) optimizer with learning rate equal to 0.02, β1 = 0.9, β2 = 0.999, and weight decay 0.01. We perform 150 optimization steps for OTI and 1000 steps for OVI. For OTI, we consistently use a single pseudo-token (R = 1). ... we fine-tune the CLIP Vi T B/32 model on the COCO dataset (Lin et al., 2014) for 30k steps, using a batch size of 512 and a learning rate of 1e-6. As an optimizer we employ Adam W with β1 = 0.9, β2 = 0.98 and a weight decay of 0.2.