reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GMAIL: Generative Modality Alignment for generated Image Learning

Authors: Shentong Mo, Sukmin Yun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through the extensive experiments across a wide range of vision-language tasks, we demonstrate the effectiveness of our framework by incorporating it with various vision-language models such as LLaVA (Liu et al., 2023). In this section, we provide the experimental setup, evaluation metrics, and comparative analysis conducted to validate the effectiveness of our method. Through rigorous experimentation on a diverse set of datasets, we assess our model on image captioning, zero-shot image retrieval, and zero-shot image classification tasks, comparing it against existing benchmarks to highlight our contributions.
Researcher Affiliation	Academia	1Department of Machine Learning, CMU, USA 2Department of Machine Learning, MBZUAI, UAE 3Department of Artificial Intelligence, Hanyang University ERICA, South Korea. Correspondence to: Sukmin Yun <EMAIL>.
Pseudocode	Yes	B. GMAIL Algorithm In this section, we outline the algorithm that implements the Generative Modality Alignment for generated Image Learning (GMAIL) framework, incorporating the Gen-CLIP flow for training on generated images and the CLIP flow for inference on real images. This algorithm also details the cross-modality alignment loss and how we ensure alignment with vision-language models (VLMs) such as CLIPCap (Mokady et al., 2021), LLaVA (Liu et et al., 2023), and LLaMA-3 (Meta, 2024). Algorithm 1 summarizes the training and inference process for the GMAIL framework, detailing how the model is trained on generated images using the Gen-CLIP flow, and subsequently applied to real images during inference. The algorithm also explains how to integrate aligned generated and real data with vision-language models such as CLIPCap, LLaVA, and LLaMA-3 for downstream tasks.
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code, nor does it include a link to a code repository for the described methodology.
Open Datasets	Yes	Our experiments leverage a comprehensive collection of datasets to evaluate the versatility and effectiveness of our proposed Gen-Real alignment framework. We focus on a diverse set of tasks, including image captioning, zero-shot image retrieval, and zero-shot image classification, ensuring broad coverage across various domains. Please refer to Appendix Section A for the detailed dataset settings. Appendix A lists all datasets with citations: COCO (Lin et al., 2014), Flickr30k (Young et al., 2014), Share GPT4V (Chen et al., 2024), CC3M (Sharma et al., 2018), CC12M (Changpinyo et al., 2021), DTD (Cimpoi et al., 2014), Stanford Cars (Krause et al., 2013), SUN397 (Xiao et al., 2010; 2014), Food 101 (Bossard et al., 2014), Aircraft (Maji et al., 2013), Oxford Pets (Parkhi et al., 2012), Caltech 101 (Fei-Fei et al., 2004), and Image Net 1K (Deng et al., 2009).
Dataset Splits	Yes	We adopt Stable Diffusion v2 (Rombach et al., 2022) to generate synthetic images using captions from the COCO (Lin et al., 2014) train2014 set. For zero-shot evaluation on both retrieval and image classification tasks, we follow the setup detailed in the original CLIP (Radford et al., 2021) paper. The number of generated images is consistent with the number of text-image pairs in the original training set: 560k for COCO, 3.3 million for CC3M, and 12 million for CC12M.
Hardware Specification	Yes	The synthetic training data were generated using Stable Diffusion v2 on NVIDIA A100-80GB GPUs.
Software Dependencies	No	The paper mentions several models and optimizers used (e.g., Stable Diffusion v2, AdamW optimizer, CLIP model, LoRA method) but does not provide specific version numbers for software libraries or programming languages required for reproduction.
Experiment Setup	Yes	During fine-tuning, we use a rank of 4 in Low-Rank Adaptation (LoRA) to adjust the model parameters specifically for generated images... For optimization, we use the AdamW optimizer with a learning rate of 1e-4 and weight decay of 0.01. We employ a cosine annealing schedule with warm restarts to dynamically adjust the learning rate... For contrastive learning, we set the temperature parameter τ = 0.07 and optimize using the AdamW optimizer with a learning rate of 1e-4 and a batch size of 256. Fine-tuning for Proj for Real and Proj for Gen was performed for 50,000 steps.