reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Finetuning CLIP to Reason about Pairwise Differences

Authors: Dylan Sam, Devin Willmott, João D. Semedo, J Zico Kolter

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute (e.g., elephants are larger than cats), which is useful in retrieval or constructing attribute-based classifiers, and improved zeroshot classification performance on many downstream image classification tasks. In addition, our approach enables a new mechanism for inference that we refer to as comparative prompting, where we leverage prior knowledge of text descriptions of differences between classes of interest, achieving even larger performance gains in classification.
Researcher Affiliation	Collaboration	Dylan Sam EMAIL Carnegie Mellon University Devin Willmott EMAIL Bosch Center for AI Joao D. Semedo EMAIL Bosch Center for AI J. Zico Kolter EMAIL Carnegie Mellon University
Pseudocode	No	The paper describes methods in prose and equations (e.g., Equation (1), (2), (3), (4), (5)) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We defer more details to Appendix A, and our code can be found here 1. 1https://github.com/dsam99/pc_clip
Open Datasets	Yes	We generate comparatives on two datasets, COCO (Lin et al., 2014) and CUB-200-2011 (Reed et al., 2016).
Dataset Splits	No	To generate our PC-CLIP finetuning dataset of pairwise comparisons, we use LLaMA2-13B-chat-hf (Touvron et al., 2023)... As the number of pairs scales quadratically in the dataset size, we create pairs (and their corresponding language differences) from 1000 randomly sampled images.
Hardware Specification	Yes	We compute our LLM-generated comparatives using a single A100 GPU or 2 A6000 GPUs, and the total process requires approximately 30 GPU hours. In our finetuning of the text encoder of PC-CLIP, we use a single A100 or A6000 GPU, which takes roughly 12 GPU hours to train for 20 epochs over our set of roughly 560,000 comparatives and pairs of images on COCO.
Software Dependencies	Yes	To generate our PC-CLIP finetuning dataset of pairwise comparisons, we use LLaMA2-13B-chat-hf (Touvron et al., 2023).
Experiment Setup	Yes	PC-CLIP COCO Finetuning We finetune CLIP with our comparative-based objective on COCO using the following hyperparameter values: τ = 1.0 as our temperature value in the contrastive loss function learning rate of 10^-8, with an exponential scheduler with γ = 0.9 20 epochs of finetuning batch size of 512