Gramian Multimodal Representation Learning and Alignment
Authors: Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, Danilo Comminiello
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally show how rethinking the multimodal alignment process with GRAM brings a consistent improvement to the state of the art in downstream tasks, and that our intuition that more modalities altogether provide richer semantic information is validated. Specifically, the multimodal model pretrained with the GRAM contrastive loss outperforms SOTA models by 5 to 10 points in tasks such as video-audio-text retrieval and audio-video classification. 4 EXPERIMENTAL EVIDENCES In this Section, we present the main results of the proposed GRAM contrastive loss and model in downstream tasks. In addition, we show how the multimodal latent space built with GRAM is more meaningful and disentangled with respect to others. Table 1: Zero-shot multimodal text-to-video (T2V) and video-to-text (V2T) retrieval results in terms of Recall at 1 score (R@1). Table 2: Finetuning multimodal text-to-video (T2V) and video-to-text (V2T) retrieval results in terms of Recall at 1 score (R@1). Figure 4: t-SNE visualization on VGGSound of VAST, cosine-based, (left) and GRAM (right) latent spaces. |
| Researcher Affiliation | Academia | Giordano Cicchetti , Eleonora Grassucci , Luigi Sigillo, Danilo Comminiello Dept. of Information Engineering, Electronics, and Telecomm., Sapienza University of Rome, Italy {name.surname}@uniroma1.it |
| Pseudocode | No | The paper describes methods and mathematical formulations but does not contain explicitly labeled pseudocode or algorithm blocks. The methods are described narratively within the text. |
| Open Source Code | Yes | The project page, the code and the pretrained models are available at https://ispamm.github.io/GRAM/. First, the source code implementing our multimodal representation learning model is available as part of the supplementary materials. The code includes all scripts necessary for training, evaluation, and data processing, while pretrained models will be released after reviewing process. |
| Open Datasets | Yes | As downstream datasets, we consider several well-known multimodal benchmarks that can be divided into three categories: (i) three-modal video-based, such as Di De Mo and Activity Net, in which the crucial modality is video and the two other modalities (audio and text) are supportive; (ii) four modal video-based, such as MSR-VTT and VATEX, in which video is the main modality but also audio, text, and subtitles are supportive; and (iii) audio-based, like Audio Caps and VGGSound, in which the audio modality is the most relevant, while also video and text contain interesting information. Details about datasets, samples, and resolutions are in Appendix B. MSR-VTT Xu et al. (2016) VATEX Wang et al. (2019) Di De Mo Hendricks et al. (2017) Activity Net Caba Heilbron et al. (2015) Audio Caps Kim et al. (2019) VGGSound Chen et al. (2020) |
| Dataset Splits | Yes | For every dataset, we utilize the official split for retrieval tasks, the dataset splits, and the number of frames for fine-tuning and evaluations on all the datasets in Tab. 5. We pretrain the GRAM-based model on a subset of the VAST27M Chen et al. (2023c) dataset comprising 150k random samples with a learning rate of 1e 4 using the Adam W optimizer with weight decay and batch size of 256. For finetuning we reduce the batch size to 64 and change the number of epochs according to the specific dataset, the complete details are shown in 5. Table 5: Dataset statistics and hyperparameters. Modalities stand for T: text, V: video, A: audio, S: subtitles, D: depth. # Frames refers both to training and inference. |
| Hardware Specification | Yes | We set the batch size to 256 and a single epoch pretraining on 4 NVIDIA A100 cards. |
| Software Dependencies | No | The paper mentions encoders like BERT-B, BEATs, and EVAClip-ViT-G and optimizers like AdamW, but does not provide specific version numbers for these software components or the overall programming environment (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We pretrain the GRAM-based model on a subset of the VAST27M Chen et al. (2023c) dataset comprising 150k random samples with a learning rate of 1e 4 using the Adam W optimizer with weight decay and batch size of 256. For finetuning we reduce the batch size to 64 and change the number of epochs according to the specific dataset, the complete details are shown in 5. |