Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions
Authors: Lucas Moeller, Pascal Tilli, Thang Vu, Sebastian Padó
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we apply our feature-interaction attributions to Clip models. We focus on evaluating the interactions between mentioned objects in captions and corresponding regions in images by selecting tokenranges in captions and analyzing their interactions with image patches. In the first series of experiments, we compare our attributions against baselines (Section 4.2). The second series in Section 4.3 then utilizes our method and analyzes Clip models. |
| Researcher Affiliation | Academia | Lucas Moeller EMAIL Pascal Tilli EMAIL Ngoc Thang Vu EMAIL Sebastian Pado EMAIL University of Stuttgart |
| Pseudocode | Yes | Algorithm 1 Py Torch-like pseudocode sketching the computation of our attributions. |
| Open Source Code | Yes | Code is publicly available: https://github.com/lucasmllr/exCLIP |
| Open Datasets | Yes | We base our evaluation on three image-caption datasets that additionally contain object bounding-box annotations in images, Microsoft s Common Objects in Context (Coco) (Lin et al., 2014), the Flickr30k collection (Young et al., 2014) with entity annotations (Plummer et al., 2015), and the Hard Negative Captions (Hnc) dataset by Dönmez et al. (2023). |
| Dataset Splits | Yes | on Flickr30k we use the test split, and on Coco we use the validation split for our analysis as the test split does not contain captions1. 1https://www.kaggle.com/datasets/shtvkumar/karpathy-splits |
| Hardware Specification | Yes | Weight decay is set to 1 10 4 and the batch size is 64 on a single 50GB Nvidia A6000. |
| Software Dependencies | No | The paper mentions "Py Torch-like pseudocode" and uses `from torch import Tensor`, indicating the use of PyTorch. However, no specific version numbers for PyTorch, Python, or other libraries are provided. |
| Experiment Setup | Yes | We run all trainings for five epochs using Adam W (Loshchilov & Hutter, 2018), starting with an initial learning rate of 1 10 7 that exponentially increases to 1 10 5. Weight decay is set to 1 10 4 and the batch size is 64 on a single 50GB Nvidia A6000. |