Boosting the visual interpretability of CLIP via adversarial fine-tuning
Authors: Shizhan Gong, Haoyu LEI, Qi Dou, Farzan Farnia
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations by both feature attribution techniques and network dissection offer convincing evidence that the visual interpretability of CLIP has significant improvements. With AFT, the image encoder prioritizes pertinent input features, and the neuron within the encoder exhibits better alignment with human-understandable concepts. Moreover, these effects are generalizable to out-of-distribution datasets and can be transferred to downstream tasks. Additionally, AFT enhances the visual interpretability of derived large vision-language models that incorporate the pre-trained CLIP an integral component. |
| Researcher Affiliation | Academia | Shizhan Gong, Haoyu Lei, Qi Dou & Farzan Farnia Department of Computer Science and Engineering The Chinese University of Hong Kong EMAIL |
| Pseudocode | No | The paper describes the AFT algorithm and its theoretical underpinnings in Section 3, but it does not present a formal pseudocode block or algorithm box. |
| Open Source Code | Yes | The code of this paper is available at https://github.com/peterant330/CLIP_AFT. |
| Open Datasets | Yes | We fine-tuned the CLIP using the Image Net (Deng et al., 2009) training set for 2 epochs. ... We also evaluate the interpretability of the image encoder through clip-dissect (Oikarinen & Weng, 2022) and network dissect (Bau et al., 2017). Our findings reveal that AFT enhances the alignment of neural activations with human-understandable concepts and promotes more object-centric activations. ... We visualize the SG for several images sourced from finegrained classification and medical image datasets (Fig. 2 (c)). ... We conduct Remove and Retrain (ROAR) analysis (Hooker et al., 2019) on the saliency maps. ... We perform the analysis on both in-distribution datasets (Imagenette, a ten-class subset of Image Net) and out-of-distribution datasets (CUB-200 (Wah et al., 2011) and Caltech-256 (Griffin et al., 2007)). ... We assess the localization capabilities of the saliency maps using the Image Net-Segmentation (Gao et al., 2022) validation set, which includes segmentation annotations for 12,419 images across 919 categories from Image Net. ... We apply CLIP-dissect (Oikarinen & Weng, 2022) to discover the concept detectors. Specifically, we use the the broadly and densely labeled (Broden) dataset (Bau et al., 2017). ... We show several results from the COCO dataset (Lin et al., 2014) in Fig. 7. ... The evaluation datasets include: Cal Tech101 (Fei-Fei et al., 2004), Stanford Cars (Krause et al., 2013), CIFAR10, CIFAR100 (Krizhevsky et al., 2009), DTD (Cimpoi et al., 2014), Euro SAT (Helber et al., 2019) FGVC Aircrafts (Maji et al., 2013), Flowers (Nilsback & Zisserman, 2008), Image Net-R (Hendrycks et al., 2021), Image Net-Sketch (Wang et al., 2019), PCAM (Veeling et al., 2018), Oxford Pets (Parkhi et al., 2012), and STL-10 (Coates et al., 2011).We also test performance on the validation set of Image Net (Deng et al., 2009). ... We perform AFT these models with the training data from ... The ISIC 2024 Challenge Dataset (Kurtansky et al., 2024), and 3) MIMIC-CXR (Johnson et al., 2019). |
| Dataset Splits | Yes | We fine-tuned the CLIP using the Image Net (Deng et al., 2009) training set for 2 epochs. ... We selected test images from the Imagenette dataset, which consists of 10 easily classified categories from Image Net. ... We assess the localization capabilities of the saliency maps using the Image Net-Segmentation (Gao et al., 2022) validation set, which includes segmentation annotations for 12,419 images across 919 categories from Image Net. ... We collected 10 validation cases from Image Net for both the original CLIP and CLIP with AFT, where each set included 5 cases with correct zero-shot predictions and 5 with incorrect predictions. For each case, we displayed 4 Grad CAM maps generated by the network corresponding to the top predicted classes: for correct cases, these included the correct class, and for incorrect cases, we showed the top three wrong classes alongside the true class. The 20 cases were also shuffled randomly. |
| Hardware Specification | Yes | All experiments were conducted on NVIDIA GeForce RTX 4090 GPUs. |
| Software Dependencies | No | The paper mentions using 'Adam W (Loshchilov et al., 2017)' as an optimizer but does not specify version numbers for any programming languages, libraries, or other software components. |
| Experiment Setup | Yes | We fine-tuned the CLIP using the Image Net (Deng et al., 2009) training set for 2 epochs. We experimented with both Vi T (Dosovitskiy et al., 2020) and Res Net (He et al., 2016) architecture. Without special illustration, the results are based on Vi T-B/16 with σ = η = 1/255 and ϵ = 4/255. For AFT, we applied PGD (Madry et al., 2017) for 10 steps. All experiments were conducted on NVIDIA Ge Force RTX 4090 GPUs. ... We use Adam W (Loshchilov et al., 2017) optimizer with momenta coefficients β1 and β2 to be 0.9 and 0.95 respectively. The training was done with a cosine decaying learning rate schedule with a linear warm-up and a peak learning rate to be 1e-5. The weight decay is set to 1e-4 and the batch size is 128 for RN50 and Vi T-B and 64 for Vi T-L respectively. We sample only 1 image per interaction from the Gaussian distribution as a simplified implementation of Gaussian smoothing. |