GMAIL: Generative Modality Alignment for generated Image Learning

Authors: Shentong Mo, Sukmin Yun

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through the extensive experiments across a wide range of vision-language tasks, we demonstrate the effectiveness of our framework by incorporating it with various vision-language models such as LLaVA (Liu et al., 2023). In this section, we provide the experimental setup, evaluation metrics, and comparative analysis conducted to validate the effectiveness of our method. Through rigorous experimentation on a diverse set of datasets, we assess our model on image captioning, zero-shot image retrieval, and zero-shot image classification tasks, comparing it against existing benchmarks to highlight our contributions.
Researcher Affiliation Academia 1Department of Machine Learning, CMU, USA 2Department of Machine Learning, MBZUAI, UAE 3Department of Artificial Intelligence, Hanyang University ERICA, South Korea. Correspondence to: Sukmin Yun <EMAIL>.
Pseudocode Yes B. GMAIL Algorithm In this section, we outline the algorithm that implements the Generative Modality Alignment for generated Image Learning (GMAIL) framework, incorporating the Gen-CLIP flow for training on generated images and the CLIP flow for inference on real images. This algorithm also details the cross-modality alignment loss and how we ensure alignment with vision-language models (VLMs) such as CLIPCap (Mokady et al., 2021), LLaVA (Liu et et al., 2023), and LLaMA-3 (Meta, 2024). Algorithm 1 summarizes the training and inference process for the GMAIL framework, detailing how the model is trained on generated images using the Gen-CLIP flow, and subsequently applied to real images during inference. The algorithm also explains how to integrate aligned generated and real data with vision-language models such as CLIPCap, LLaVA, and LLaMA-3 for downstream tasks.
Open Source Code No The paper does not provide an explicit statement about releasing its own source code, nor does it include a link to a code repository for the described methodology.
Open Datasets Yes Our experiments leverage a comprehensive collection of datasets to evaluate the versatility and effectiveness of our proposed Gen-Real alignment framework. We focus on a diverse set of tasks, including image captioning, zero-shot image retrieval, and zero-shot image classification, ensuring broad coverage across various domains. Please refer to Appendix Section A for the detailed dataset settings. Appendix A lists all datasets with citations: COCO (Lin et al., 2014), Flickr30k (Young et al., 2014), Share GPT4V (Chen et al., 2024), CC3M (Sharma et al., 2018), CC12M (Changpinyo et al., 2021), DTD (Cimpoi et al., 2014), Stanford Cars (Krause et al., 2013), SUN397 (Xiao et al., 2010; 2014), Food 101 (Bossard et al., 2014), Aircraft (Maji et al., 2013), Oxford Pets (Parkhi et al., 2012), Caltech 101 (Fei-Fei et al., 2004), and Image Net 1K (Deng et al., 2009).
Dataset Splits Yes We adopt Stable Diffusion v2 (Rombach et al., 2022) to generate synthetic images using captions from the COCO (Lin et al., 2014) train2014 set. For zero-shot evaluation on both retrieval and image classification tasks, we follow the setup detailed in the original CLIP (Radford et al., 2021) paper. The number of generated images is consistent with the number of text-image pairs in the original training set: 560k for COCO, 3.3 million for CC3M, and 12 million for CC12M.
Hardware Specification Yes The synthetic training data were generated using Stable Diffusion v2 on NVIDIA A100-80GB GPUs.
Software Dependencies No The paper mentions several models and optimizers used (e.g., Stable Diffusion v2, AdamW optimizer, CLIP model, LoRA method) but does not provide specific version numbers for software libraries or programming languages required for reproduction.
Experiment Setup Yes During fine-tuning, we use a rank of 4 in Low-Rank Adaptation (LoRA) to adjust the model parameters specifically for generated images... For optimization, we use the AdamW optimizer with a learning rate of 1e-4 and weight decay of 0.01. We employ a cosine annealing schedule with warm restarts to dynamically adjust the learning rate... For contrastive learning, we set the temperature parameter τ = 0.07 and optimize using the AdamW optimizer with a learning rate of 1e-4 and a batch size of 256. Fine-tuning for Proj for Real and Proj for Gen was performed for 50,000 steps.