Intriguing Properties of Hyperbolic Embeddings in Vision-Language Models
Authors: Sarah Ibrahimi, Mina Ghadimi Atigh, Nanne Van Noord, Pascal Mettes, Marcel Worring
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we conduct a deeper study into the hyperbolic embeddings and find that they open new doors for vision-language models. In particular, we find that hyperbolic vision-language models provide spatial awareness that Euclidean vision-language models lack, are better capable of dealing with ambiguity, and effectively discriminate between distributions. Our findings shed light on the greater potential of hyperbolic embeddings in large-scale settings, reaching beyond conventional down-stream tasks. |
| Researcher Affiliation | Academia | Sarah Ibrahimi EMAIL University of Amsterdam Mina Ghadimi Atigh EMAIL University of Amsterdam Nanne van Noord EMAIL University of Amsterdam Pascal Mettes EMAIL University of Amsterdam Marcel Worring EMAIL University of Amsterdam |
| Pseudocode | No | The paper includes mathematical definitions and equations related to hyperbolic geometry, such as Lorentzian inner product (Equation 1), Lorentzian norm (Equation 2), Geodesics (Equation 4), Tangent space (Equation 5, 6), Exponential map (Equation 7), and Logarithmic map (Equation 8). However, it does not contain any clearly labeled pseudocode or algorithm blocks describing a step-by-step procedure. |
| Open Source Code | Yes | Our code is available at https://github.com/saibr/hypvl |
| Open Datasets | Yes | VL-Checklist (Zhao et al., 2022), VG-Relations (Yüksekgönül et al., 2023) and CLIPbind-r (Lewis et al., 2023). Propaganda Memes: This benchmark dataset contains 950 ambiguous image-text memes spanning 22 propaganda techniques, including whataboutism, obfuscation, and glittering generalities (Dimitrov et al., 2021). My Reaction When: This benchmark dataset has 50K video-sentence pairs from social media depicting physical/emotional reactions to textual captions including a high ambiguity (Song & Soleymani, 2019). Visual Word Sense Disambiguation: This benchmark evaluates associating ambiguous words like mouse , with intended meanings from contextual images (Raganato et al., 2023). To quantify textual hierarchy encoding, we leverage label trees from six vision datasets: CIFAR-100, Animals with Attributes 2 (AWA2) (Xian et al., 2019), PASCAL-VOC (Everingham et al., 2010), UCF (Soomro et al., 2012), Kinetics (Kay et al., 2017), and Activity Net (Heilbron et al., 2015). As In-Distribution (ID) datasets, we use CIFAR-10, Image Net200, and Image Net-1k. The out-of-distribution datasets are divided into two categories: near-OOD (hard-OOD) and far OOD (easy-OOD), based on image semantics and empirical difficulty. For CIFAR-10, we use CIFAR100 and Tiny Image Net (Le & Yang) as the near-OOD datasets and MNIST (Le Cun et al., 1998), SVHN (Netzer et al., 2011), Textures (Cimpoi et al., 2014), and Places365 (Zhou et al., 2017) as far-OOD datasets. For Image Net200 and Image Net we use the following datasets. For near-OOD, we use SSB-hard (Vaze et al., 2022), a dataset composed of images from Image Net-21K and NINCO by Bitterwolf et al. (2023). For far-OOD, we use i Naturalist (Horn etg., 2018), Textures (Cimpoi et al., 2014), and Open Image-O (Wang et al., 2022). |
| Dataset Splits | Yes | VL-Checklist: a spatial reasoning benchmark dataset that uses 30k Visual Genome (Krishna et al., 2017) images with two different descriptions. VG-Relations: a spatial reasoning benchmark dataset that consists of 48 relations with nearly 24k test cases. CLIPbind-r: a synthetic CLEVR-inspired benchmark dataset (Johnson et al., 2017) that is created by a tool for 3D modelling and rendering (Lewis et al., 2023) with 10k images for validation. My Reaction When: This benchmark dataset has 50K video-sentence pairs... We evaluate on video-text and text-video retrieval using widely used mean pooling to aggregate per-frame CLIP embeddings into a single video representation (Luo et al., 2022). For hyperbolic CLIP, video-text similarity is the Lorentzian inner product between mean pooled video and sentence embeddings, E video, E T L. Euclidean CLIP uses cosine similarity. Evaluation is via recall@k (top-k accuracy), checking if ground truth pairs are retrieved in closest neighborhoods. This tests semantic matching between ambiguous pairs of reactions and descriptive captions. Visual Word Sense Disambiguation... We evaluate the model on the full dataset that consists of 13332 samples. Evaluation uses Mean Reciprocal Rank at 5 and 10 (MRR@5, MRR@10) plus Hit Rate at 1 (HIT@1) to measure disambiguation capabilities given varying levels of contextual grounding. |
| Hardware Specification | No | The paper mentions that "hyperbolic operations increase the computational cost of a neural network" and provides a comparison of running time for experiments ("114.8 seconds for hyperbolic CLIP evaluation and 113.6 seconds for Euclidean CLIP"). However, it does not specify any particular hardware like GPU models (e.g., NVIDIA A100), CPU types, or other specific computational resources used for these experiments. |
| Software Dependencies | No | The paper mentions the use of CLIP (Radford et al., 2021) model, Vision Transformer (ViT) (Dosovitskiy et al., 2021), and Transformer (Vaswani et al., 2017), but does not provide specific version numbers for any software libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow) used to implement the models or run the experiments. |
| Experiment Setup | Yes | In our study, we evaluate the Euclidean and hyperbolic Vi T-L/16 backbones on all tasks and for each property, we add a comparison study including all backbone sizes. For fair assessment, we use the provided checkpoints from Desai et al. (2023) in a zero-shot evaluation mode. Thus, we do not train nor finetune any model ourselves. All models output 512-dim embeddings, we will use these embeddings directly for all our experiments and do not add any additional linear layers to the model. Tasks use the original similarity score from CLIP (Radford et al., 2021), which is the cosine similarity, and the Lorentzian inner product for hyperbolic CLIP, unless otherwise specified. We follow Sanchez-Lengeling et al. (2019) by using a resampling method with replacement on the test set for 500 times, due to common practice. This results in a distribution of 500 data points for which we report the mean and standard deviation for all experiments. We perform one-sided t-tests on the data distributions of the Euclidean and hyperbolic models with the null hypothesis stating that the data distribution of the hyperbolic model has a higher mean value than the Euclidean model with a significance level of 2.5%. |