Whitened CLIP as a Likelihood Surrogate of Images and Captions

Authors: Roy Betser, Meir Yossef Levi, Guy Gilboa

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions. All the experiments in this section employ the CLIP Vi TL/14 model and utilize the MS-COCO validation set to compute the whitening matrix W.
Researcher Affiliation Academia 1Viterbi Faculty of Electrical and Computer Engineering, Technion Israel Institute of Technology, Haifa, Israel.. Correspondence to: Roy Betser <EMAIL>, Meir Yossef Levi <EMAIL>, Guy Gilboa <EMAIL>.
Pseudocode Yes Algorithm 1 Whitening Process Input: Dataset X RN d, correlation threshold τ Output: Whitening matrix W Step 1: Compute Correlation Matrix. Calculate the correlation matrix: Cij = Cov(Xi, Xj) Step 2: Remove Highly Correlated Features. Identify feature pairs (i, j) where |Cij| > τ. For each pair, remove one feature (e.g., j) and replace it with random noise r, Denote the updated dataset as X : r N(0, 0.1). Step 3: Compute Covariance Matrix. Calculate the covariance matrix: Σ = 1 N X T X Step 4: Perform Eigenvalue Decomposition. Decompose Σ into eigenvalues Λ and eigenvectors V : Σ = V ΛV T Step 5: Compute Whitening Matrix and Transform Data. Calculate the whitening matrix: W = Λ 1/2V T , where Λ 1/2 is a diagonal matrix with elements given by the inverse square root of the eigenvalues: (Λ 1/2 )ii = 1 p λi
Open Source Code Yes Our code is available here. Our code along detailed instructions is available HERE.
Open Datasets Yes The 5000 embeddings of MS-COCO validation set (Lin et al., 2014) are divided into 20 equal groups of 250 samples each. Fig. 3 evaluates a subset of Image Net (Deng et al., 2009), as presented in Kan et al. (2018), in comparison to Image Net-A (Hendrycks et al., 2021b), Image Net-C (Hendrycks & Dietterich, 2019), and Image Net R (Hendrycks et al., 2021a). Flickr8k (Hodosh et al., 2013), similarly to MS-COCO, is a benchmark for image-captioning tasks that emphasizes real-world imagery and descriptive diversity. We sampled 5,000 sentences from Open Web Text (Gokaslan et al., 2019), a general text dataset...
Dataset Splits Yes For stability, the 5000 embeddings of MS-COCO validation set (Lin et al., 2014) are divided into 20 equal groups of 250 samples each. For each size (1k, 2k, 3k, 4k) we randomly sampled 5 subsets of MS-COCO validation set.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or cluster specifications) are mentioned in the paper for running experiments.
Software Dependencies No The paper mentions several models and frameworks (e.g., CLIP Vi TL/14, GPT-2, Un CLIP) but does not provide specific version numbers for any ancillary software dependencies like Python, PyTorch, or other libraries.
Experiment Setup No The paper describes the models used (e.g., CLIP Vi TL/14, Un CLIP) and the datasets (MS-COCO, ImageNet), along with specific processing steps like whitening and normalization, but it does not specify concrete hyperparameters like learning rates, batch sizes, or number of epochs for training any models used in the experiments.