reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions

Authors: Lucas Moeller, Pascal Tilli, Thang Vu, Sebastian Padó

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we apply our feature-interaction attributions to Clip models. We focus on evaluating the interactions between mentioned objects in captions and corresponding regions in images by selecting tokenranges in captions and analyzing their interactions with image patches. In the first series of experiments, we compare our attributions against baselines (Section 4.2). The second series in Section 4.3 then utilizes our method and analyzes Clip models.
Researcher Affiliation	Academia	Lucas Moeller EMAIL Pascal Tilli EMAIL Ngoc Thang Vu EMAIL Sebastian Pado EMAIL University of Stuttgart
Pseudocode	Yes	Algorithm 1 Py Torch-like pseudocode sketching the computation of our attributions.
Open Source Code	Yes	Code is publicly available: https://github.com/lucasmllr/exCLIP
Open Datasets	Yes	We base our evaluation on three image-caption datasets that additionally contain object bounding-box annotations in images, Microsoft s Common Objects in Context (Coco) (Lin et al., 2014), the Flickr30k collection (Young et al., 2014) with entity annotations (Plummer et al., 2015), and the Hard Negative Captions (Hnc) dataset by Dönmez et al. (2023).
Dataset Splits	Yes	on Flickr30k we use the test split, and on Coco we use the validation split for our analysis as the test split does not contain captions1. 1https://www.kaggle.com/datasets/shtvkumar/karpathy-splits
Hardware Specification	Yes	Weight decay is set to 1 10 4 and the batch size is 64 on a single 50GB Nvidia A6000.
Software Dependencies	No	The paper mentions "Py Torch-like pseudocode" and uses `from torch import Tensor`, indicating the use of PyTorch. However, no specific version numbers for PyTorch, Python, or other libraries are provided.
Experiment Setup	Yes	We run all trainings for five epochs using Adam W (Loshchilov & Hutter, 2018), starting with an initial learning rate of 1 10 7 that exponentially increases to 1 10 5. Weight decay is set to 1 10 4 and the batch size is 64 on a single 50GB Nvidia A6000.