Causal Graphical Models for Vision-Language Compositional Understanding

Authors: Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi, Rita Cucchiara

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using extensive experiments on five compositional benchmarks, we show that our method significantly outperforms all the state-of-the-art compositional approaches by a large margin, and it also improves over methods trained using much larger datasets. Our model weights and code are publicly available.1
Researcher Affiliation Academia Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi & Rita Cucchiara AImage Lab, Dipartimento di Ingegneria Enzo Ferrari University of Modena and Reggio Emilia EMAIL
Pseudocode No The paper describes the methodology and model architecture in detail, including figures (e.g., Fig. 2 for the decoder illustration), but it does not present any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Our model weights and code are publicly available.1 https://github.com/aimagelab/COGT
Open Datasets Yes In our evaluation we use four common compositional benchmarks: ARO (Yuksekgonul et al., 2023), Sugar Crepe (Hsieh et al., 2024), VL-Check List (Zhao et al., 2022) and Color Swap (Burapacheep et al., 2024), and an additional benchmark FG-OVD (Bianchi et al., 2024) which we propose in this paper. ... In the experiments of this section we follow a widely adopted protocol, first proposed in (Yuksekgonul et al., 2023), in which the VLM backbone is CLIP and the only training dataset is COCO (Lin et al., 2014). ... Additionally, we present additional results training on a combination of three datasets: COCO, CC3M (Sharma et al., 2018), and Visual Genome (Krishna et al., 2017).
Dataset Splits Yes In the experiments of this section we follow a widely adopted protocol, first proposed in (Yuksekgonul et al., 2023), in which the VLM backbone is CLIP and the only training dataset is COCO (Lin et al., 2014). ... COGT-CLIP, COGT-XVLM and COGT-Instruct BLIP are trained on COCO only (~100K training samples, see Sec. 4.1). Moreover, following (Zhang et al., 2024), we present additional results training on a combination of three datasets: COCO, CC3M (Sharma et al., 2018), and Visual Genome (Krishna et al., 2017). In this case, we use a decoder D with four blocks. Note that we use only 50K samples from Visual Genome because we removed those training data which overlap with ARO and VL-Checklist. ... Following Yuksekgonul et al. (2023), we select the best checkpoint using the validation set provided in (Yuksekgonul et al., 2023).
Hardware Specification Yes COGT-CLIP and COGT-CLIP+ require respectively 8 and 72 hours to train on a single RTX A5000 GPU with a batch size of 128 using the datasets of Sec. 4.2. ... All times are computed using an RTX A5000 GPU with a batch size of 32.
Software Dependencies No We train in mixed precision (FP16) with batch size set to 128 on a GPU RTX A5000 with 24GB of VRAM for 10 epochs. Following Yuksekgonul et al. (2023), we select the best checkpoint using the validation set provided in (Yuksekgonul et al., 2023). In all the datasets and in all the experiments, we use the Adam optimizer with an initial learning rate set to 5e-4. Finally, we apply a Cosine Annealing Learning Rate Scheduler with 50 warmup steps. The paper mentions software components like Adam optimizer, but does not provide specific version numbers for any libraries or frameworks used (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes We train in mixed precision (FP16) with batch size set to 128 on a GPU RTX A5000 with 24GB of VRAM for 10 epochs. Following Yuksekgonul et al. (2023), we select the best checkpoint using the validation set provided in (Yuksekgonul et al., 2023). In all the datasets and in all the experiments, we use the Adam optimizer with an initial learning rate set to 5e-4. Finally, we apply a Cosine Annealing Learning Rate Scheduler with 50 warmup steps.