Finetuning CLIP to Reason about Pairwise Differences
Authors: Dylan Sam, Devin Willmott, João D. Semedo, J Zico Kolter
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute (e.g., elephants are larger than cats), which is useful in retrieval or constructing attribute-based classifiers, and improved zeroshot classification performance on many downstream image classification tasks. In addition, our approach enables a new mechanism for inference that we refer to as comparative prompting, where we leverage prior knowledge of text descriptions of differences between classes of interest, achieving even larger performance gains in classification. |
| Researcher Affiliation | Collaboration | Dylan Sam EMAIL Carnegie Mellon University Devin Willmott EMAIL Bosch Center for AI Joao D. Semedo EMAIL Bosch Center for AI J. Zico Kolter EMAIL Carnegie Mellon University |
| Pseudocode | No | The paper describes methods in prose and equations (e.g., Equation (1), (2), (3), (4), (5)) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We defer more details to Appendix A, and our code can be found here 1. 1https://github.com/dsam99/pc_clip |
| Open Datasets | Yes | We generate comparatives on two datasets, COCO (Lin et al., 2014) and CUB-200-2011 (Reed et al., 2016). |
| Dataset Splits | No | To generate our PC-CLIP finetuning dataset of pairwise comparisons, we use LLaMA2-13B-chat-hf (Touvron et al., 2023)... As the number of pairs scales quadratically in the dataset size, we create pairs (and their corresponding language differences) from 1000 randomly sampled images. |
| Hardware Specification | Yes | We compute our LLM-generated comparatives using a single A100 GPU or 2 A6000 GPUs, and the total process requires approximately 30 GPU hours. In our finetuning of the text encoder of PC-CLIP, we use a single A100 or A6000 GPU, which takes roughly 12 GPU hours to train for 20 epochs over our set of roughly 560,000 comparatives and pairs of images on COCO. |
| Software Dependencies | Yes | To generate our PC-CLIP finetuning dataset of pairwise comparisons, we use LLaMA2-13B-chat-hf (Touvron et al., 2023). |
| Experiment Setup | Yes | PC-CLIP COCO Finetuning We finetune CLIP with our comparative-based objective on COCO using the following hyperparameter values: τ = 1.0 as our temperature value in the contrastive loss function learning rate of 10^-8, with an exponential scheduler with γ = 0.9 20 epochs of finetuning batch size of 512 |