Compositional Entailment Learning for Hyperbolic Vision-Language Models
Authors: Avik Pal, Max van Spengler, Guido D'Amely di Melendugno, Alessandro Flaborea, Fabio Galasso, Pascal Mettes
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluation on a hyperbolic visionlanguage model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning, as well as recent hyperbolic alternatives, with better zero-shot and retrieval generalization and clearly stronger hierarchical performance. |
| Researcher Affiliation | Academia | 1University of Amsterdam 2Sapienza University of Rome 3Ital AI 4Procederai |
| Pseudocode | No | The paper describes the methods using mathematical equations and textual explanations, but it does not include a distinct pseudocode or algorithm block. |
| Open Source Code | Yes | Code available at https://github.com/Pal Avik/hycoclip. |
| Open Datasets | Yes | We train our models using the large-scale training corpus Grounded Image-Text Pairs (GRIT) dataset (Peng et al., 2023) containing 20.5 million grounded vision-language pairs which are processed from the even larger COYO-700M (Byeon et al., 2022) dataset. We similarly use the grounding procedure on the Red Caps dataset (Desai et al., 2021) originally used to train MERU. Additionally, we use the smaller-scale grounded Conceptual Captions 3M (CC3M) (Li et al., 2023; Sharma et al., 2018) dataset for hyperparameter search. |
| Dataset Splits | Yes | We perform this task zero-shot on the COCO validation set (Lin et al., 2014) and the Flickr30K test set (Young et al., 2014; Karpathy & Fei-Fei, 2015). We use the Word Net hierarchy (Miller, 1994) of the Image Net class labels (Deng et al., 2009; Russakovsky et al., 2015) for the hierarchical classification task. We report the average precision (AP) on the 17 novel categories data split (Bansal et al., 2018). |
| Hardware Specification | Yes | We train our models on 4 A100 GPUs for 500k steps using a batch size of 768 on an internal cluster. |
| Software Dependencies | No | The paper mentions tools like spaCy (Honnibal et al., 2020) and optimizers like AdamW (Loshchilov & Hutter, 2019), but does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch). |
| Experiment Setup | Yes | The maximum input token size is set to 77 with a vocab size of 49408. For the vision encoder, we use the small and base Vision Transformer (Dosovitskiy et al., 2021; Chen et al., 2021; Touvron et al., 2021) backbone using a patch size of 16. The images are resized using border padding and random cropping (with scale [0.5, 1.0]) to 224 224. We train Hy Co CLIP with a fixed curvature value of the Lorentz model on the grounded CC3M dataset for 40k steps. ... We scale our batch of vectors before projecting it to the hyperboloid using learnable scalars cimg and ctxt, respectively, in both image and text modes. These scalars are initialized with a value of cimg = ctxt = 1/512. The adaptive softmax temperature of the contrastive loss is initialized with τ = 0.07 and clipped at 0.01. In the h CE loss (Equations 10,11), we set separate values of the η parameter for inter-modality entailments ηinter = 0.7 and intra-modality entailments ηintra = 1.2. In the final h C loss, we set the weight for h CE loss γ = 0.1. We train our models on 4 A100 GPUs for 500k steps using a batch size of 768. We use the Adam W optimizer (Loshchilov & Hutter, 2019) with hyperparameters β1 = 0.9, β2 = 0.98 and weight decay 0.2. We use a cosine learning rate scheduler (Loshchilov & Hutter, 2017) with a maximum learning rate of 5 10 4 and a linear rate for the initial 4k steps. |