CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning

Authors: Qingqing Cao, Mahyar Najibi, Sachin Mehta

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With extensive experiments on 31 datasets spanning different vision and vision-language tasks, we show that Ctrl Synth substantially improves zero-shot classification, image-text retrieval, and compositional reasoning performance of CLIP models.
Researcher Affiliation Industry 1Apple 2Work done at Apple. Correspondence to: Qingqing Cao <EMAIL>.
Pseudocode No The paper describes the Ctrl Synth pipeline and its components (vision tagging model, large language model, text-to-image model, and controllers) in detail, including synthesis paths and example prompts for LLMs (Figure 3). However, it does not present a formally structured pseudocode or algorithm block.
Open Source Code No The paper mentions using existing libraries and models (e.g., Core Net, LIFT codebase, Florence-large, Qwen2-7B-Instruct, Mistral-Ne Mo-instruct, stable-diffusion-xl-base-1.0, vLLM engine, diffusers library) but does not provide an explicit statement or link for the open-source release of the Ctrl Synth methodology or its implementation.
Open Datasets Yes For pretraining CLIP models, we use two public image-text datasets: CC3M (Sharma et al., 2018) and CC12M (Changpinyo et al., 2021), and Datacomp1B (Gadre et al., 2023). To evaluate the representation quality of pretrained CLIP models, we measure the zero-shot performance on classification, retrieval, and compositional reasoning tasks. For image classification, we use 25 common vision datasets, including five Image Net (Deng et al., 2009; Recht et al., 2019) variants and the tasks from the VTAB benchmark (Zhai et al., 2020). We list the detailed dataset information in Appendix A.2. We use COCO (Lin et al., 2014) and Flickr30k (Plummer et al., 2015) for image-to-text and text-to-image retrieval tasks and report the metrics in recall@1. Sugar Crepe (Hsieh et al., 2023) is a recent benchmark that measures the compositional understanding of vision-language models, we report the zero-shot accuracy numbers. Additionally, to study the effects of Ctrl Synth on long-tail tasks, we evaluate the task accuracy of Places-LT and Image Net-LT datasets (Liu et al., 2019) by augmenting the tail classes with Ctrl Synth synthetic data.
Dataset Splits No The paper mentions evaluating on 'test sets of both datasets are balanced' for Image Net-LT and Places-LT, and using specific evaluation benchmarks like Sugar Crepe. It also describes a strategy for augmenting tail classes ('We generate 7 samples per tail class so that we roughly double the size of the original real datasets.'). However, it does not explicitly provide the training/validation/test split percentages or methodology for their main pretraining datasets (CC3M, CC12M, Datacomp1B) or for how they split the long-tail datasets beyond the test set and augmentation strategy.
Hardware Specification Yes Table 8: Training hyper-parameters. (a) Pretraining CLIP on CC3M and CC12M. # A100 GPUs 8 A100 GPU Memory 40 GB # A100 GPUs 32 A100 GPU Memory 40 GB (b) Finetuning CLIP on Places LT and Image Net-LT. # A100 GPUs 1 A100 GPU Memory 40 GB
Software Dependencies No The paper mentions several software components and models like 'Adam W', 'Core Net library (Mehta et al., 2024a; 2022)', 'LIFT codebase (Shi et al., 2024)', 'Florence-large (Xiao et al., 2024)', 'Qwen2-7B-Instruct (Yang et al., 2024a)', 'Mistral-Ne Mo-instruct model (AI, 2024)', 'stable-diffusion-xl-base-1.0 (Podell et al., 2024)', 'vLLM engine (Kwon et al., 2023)', and 'diffusers (von Platen et al., 2022) library'. While specific model versions are sometimes mentioned (e.g., stable-diffusion-xl-base-1.0), general software dependencies like Python, PyTorch, or specific versions for Core Net, LIFT, or the diffusers library itself are not consistently provided to ensure full reproducibility.
Experiment Setup Yes Table 8: Training hyper-parameters. (a) Pretraining CLIP on CC3M and CC12M. Hyperparameter CC3M CC12M Total iterations 56,429 55,429 Warmup iterations 2822 2771 Image size 224 224 LR scheduler Cosine Cosine Max. LR 0.002 0.002 Min. LR 0.00002 0.00002 Optimizer Adam W Adam W Adam W β s (0.9, 0.98) (0.9, 0.98) Weight decay 0.2 0.2 Batch size per GPU 256 256 (b) Finetuning CLIP on Places LT and Image Net-LT. Hyperparameter Places-LT Image Net-LT Total Iterations 56,429 55,429 Warmup Iterations 2822 2771 Image size 224 224 Loss type Cross Entropy Cross Entropy LR scheduler Cosine Cosine Learning rate 0.01 0.01 Optimizer SGD SGD Momentum 0.9 0.9 Weight decay 5e-4 5e-4 Batch size per GPU 128 128