reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Compositional Entailment Learning for Hyperbolic Vision-Language Models

Authors: Avik Pal, Max van Spengler, Guido D'Amely di Melendugno, Alessandro Flaborea, Fabio Galasso, Pascal Mettes

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluation on a hyperbolic visionlanguage model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning, as well as recent hyperbolic alternatives, with better zero-shot and retrieval generalization and clearly stronger hierarchical performance.
Researcher Affiliation	Academia	1University of Amsterdam 2Sapienza University of Rome 3Ital AI 4Procederai
Pseudocode	No	The paper describes the methods using mathematical equations and textual explanations, but it does not include a distinct pseudocode or algorithm block.
Open Source Code	Yes	Code available at https://github.com/Pal Avik/hycoclip.
Open Datasets	Yes	We train our models using the large-scale training corpus Grounded Image-Text Pairs (GRIT) dataset (Peng et al., 2023) containing 20.5 million grounded vision-language pairs which are processed from the even larger COYO-700M (Byeon et al., 2022) dataset. We similarly use the grounding procedure on the Red Caps dataset (Desai et al., 2021) originally used to train MERU. Additionally, we use the smaller-scale grounded Conceptual Captions 3M (CC3M) (Li et al., 2023; Sharma et al., 2018) dataset for hyperparameter search.
Dataset Splits	Yes	We perform this task zero-shot on the COCO validation set (Lin et al., 2014) and the Flickr30K test set (Young et al., 2014; Karpathy & Fei-Fei, 2015). We use the Word Net hierarchy (Miller, 1994) of the Image Net class labels (Deng et al., 2009; Russakovsky et al., 2015) for the hierarchical classification task. We report the average precision (AP) on the 17 novel categories data split (Bansal et al., 2018).
Hardware Specification	Yes	We train our models on 4 A100 GPUs for 500k steps using a batch size of 768 on an internal cluster.
Software Dependencies	No	The paper mentions tools like spaCy (Honnibal et al., 2020) and optimizers like AdamW (Loshchilov & Hutter, 2019), but does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch).
Experiment Setup	Yes	The maximum input token size is set to 77 with a vocab size of 49408. For the vision encoder, we use the small and base Vision Transformer (Dosovitskiy et al., 2021; Chen et al., 2021; Touvron et al., 2021) backbone using a patch size of 16. The images are resized using border padding and random cropping (with scale [0.5, 1.0]) to 224 224. We train Hy Co CLIP with a fixed curvature value of the Lorentz model on the grounded CC3M dataset for 40k steps. ... We scale our batch of vectors before projecting it to the hyperboloid using learnable scalars cimg and ctxt, respectively, in both image and text modes. These scalars are initialized with a value of cimg = ctxt = 1/512. The adaptive softmax temperature of the contrastive loss is initialized with τ = 0.07 and clipped at 0.01. In the h CE loss (Equations 10,11), we set separate values of the η parameter for inter-modality entailments ηinter = 0.7 and intra-modality entailments ηintra = 1.2. In the final h C loss, we set the weight for h CE loss γ = 0.1. We train our models on 4 A100 GPUs for 500k steps using a batch size of 768. We use the Adam W optimizer (Loshchilov & Hutter, 2019) with hyperparameters β1 = 0.9, β2 = 0.98 and weight decay 0.2. We use a cosine learning rate scheduler (Loshchilov & Hutter, 2017) with a maximum learning rate of 5 10 4 and a linear rate for the initial 4k steps.