Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts

Authors: Jonathan Crabbé, Pau Rodriguez, Vaishaal Shankar, Luca Zappella, Arno Blaas

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we bridge this gap by probing the representation spaces of 16 robust zero-shot CLIP vision encoders with various backbones (Res Nets and Vi Ts) and pretraining sets (Open AI, LAION-400M, LAION-2B, YFCC15M, CC12M and Data Comp), and comparing them to the representation spaces of less robust models with identical backbones, but different (pre)training sets or objectives (CLIP pretraining on Image Net-Captions, and supervised training or finetuning on Image Net). Through this analysis, we generate three novel insights.
Researcher Affiliation Collaboration Jonathan Crabbé EMAIL University of Cambridge (work done while at Apple) Pau Rodríguez EMAIL Apple Vaishaal Shankar EMAIL Apple Luca Zappella EMAIL Apple Arno Blaas EMAIL Apple
Pseudocode No The paper describes mathematical formulas for the contrastive loss and zero-shot classifier logits, but does not provide structured pseudocode or algorithm blocks. Methods are described within the regular text and equations.
Open Source Code No The paper mentions leveraging models from the Open CLIP repository (Ilharco et al., 2021) and checkpoints provided by Fang et al. (2022). However, it does not explicitly state that the authors are releasing their own code for the methodology described in this paper.
Open Datasets Yes The paper makes extensive use of well-known public datasets and cites them: "Image Net (Deng et al., 2009)", "Image Net-V2 (Recht et al., 2019)", "Image Net-R (Hendrycks et al., 2021a)", "Image Net-Sketch (Wang et al., 2019)", "Object Net (Barbu et al., 2019)", "Image Net-A (Hendrycks et al., 2021b)", "YFCC-15M (Thomee et al., 2016; Radford et al., 2021)", "CC-12M (Changpinyo et al., 2021)", "LAION-400M, LAION-2B (Schuhmann et al., 2022)", "Data Comp (Cherti et al., 2023)", and the "Broden dataset (Bau et al., 2017)".
Dataset Splits Yes We use the Image Net test set to produce activation vectors h(n) = fv(x(n)) Rd H for each image x(n) Rd X fed to the encoder. We report the average kurtosis over the Image Net test set... To obtain the finetuned CLIP models... finetune these models for 10 epochs on the Image Net training set... We train these modified Res Net models from scratch for 90 epochs on the Image Net training set...
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments. It only generally refers to training models.
Software Dependencies No The paper mentions "Py Torch-Model-Compare package (Subramanian, 2021)" and "torchvision (Torch Vision maintainers and contributors, 2016)" but does not specify version numbers for these or other key software components like PyTorch or Python itself.
Experiment Setup Yes To obtain the finetuned CLIP models... we then finetune these models for 10 epochs on the Image Net training set, using a batch size of 256 and a learning rate of 3 10 5 with a cosine annealing learning rate scheduler and a warm-up of 500 steps. We use the Adam W optimizer and set the weight decay to 0.1. For the Supervised Image Net models... We train these modified Res Net models from scratch for 90 epochs on the Image Net training set, using a batch size of 1024. We use Adam W, and a learning rate schedule decaying from 10 3 to 10 4 after 30 epochs and to 10 5 after 60 epochs (with a warm-up period of 5,000 steps). We set weight decay to 10 2. We use the standard augmentations of horizontal flip with random crop as well as label smoothing.