The Double-Ellipsoid Geometry of CLIP

Authors: Meir Yossef Levi, Guy Gilboa

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our statistical analysis and many experimental results are based on MS-COCO (Lin et al., 2014) validation set, a common standard image-text dataset. In Fig. 2, normalized histograms are shown for features 93, 134 and 494 of the CLIP latent vector. To empirically analyze the uniformity and alignment terms in Eq. 8 alongside the overall loss in Eq. 7, we use the MS-COCO validation set. The results show that the loss for correctly classified samples decreases monotonically with the shift toward the origin.
Researcher Affiliation Academia 1Viterbi Faculty of Electrical and Computer Engineering, Technion Israel Institute of Technology, Haifa, Israel. Correspondence to: Meir Yossef Levi <EMAIL>, Guy Gilboa <EMAIL>.
Pseudocode No The paper describes methods and analyses using mathematical equations and prose, but it does not contain any explicitly labeled pseudocode blocks or algorithms in a structured, code-like format.
Open Source Code No The paper references existing frameworks and methods (e.g., "The un CLIP framework (Ramesh et al., 2022)", "text inversion (Han et al., 2024; Gal et al., 2022; Mokady et al., 2023)"), but it does not contain any statement from the authors about releasing their own source code for the methodology described in this paper, nor does it provide a link to a code repository.
Open Datasets Yes Our statistical analysis and many experimental results are based on MS-COCO (Lin et al., 2014) validation set, a common standard image-text dataset. We provide additional visualizations of highand low-conformity images across various datasets. Figure 19 illustrates examples of sketches from Image Net-R, while Figure 20 showcases examples from Image Net-A.
Dataset Splits Yes Our statistical analysis and many experimental results are based on MS-COCO (Lin et al., 2014) validation set, a common standard image-text dataset. We treat the entire validation set (5k samples) as a single batch.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments (e.g., Python, PyTorch, CUDA versions).
Experiment Setup No The paper discusses the normalized temperature-scaled cross entropy loss (NT-Xent) used in CLIP's training and analyzes its behavior, including varying a parameter 'alpha' for analytical purposes. However, it does not specify concrete hyperparameters like learning rates, batch sizes, number of epochs, or other system-level training settings for their own experiments or for reproducing the analyzed CLIP model.