Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
Authors: Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, Bowen Zhang, Wenze Hu, Juan Tebar, Zhe Gan, Peter Grasch, Meng Cao, Yinfei Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We synthesize several formats of captions including Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+), then conduct extensive pre-training experiments to systematically study the role of synthetic captions and their intersection with original Alt Text across three multimodal foundation models. We use Ve Cap-300M (Lai et al., 2024), a web-crawled dataset with raw Alt Text as our main pre-training dataset for CLIP. Table 1: Effect of different synthetic captions on CLIP with Vi TB/16 as the backbone. Figure 2: Zero-shot retrieval and classification performance of CLIP models. |
| Researcher Affiliation | Industry | Apple EMAIL |
| Pseudocode | No | The paper describes its methodology in natural language and illustrates its captioning pipeline with diagrams (e.g., Figure A2), but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | No explicit statement or link regarding the release of source code for the methodology described in this paper is provided. |
| Open Datasets | Yes | We use Ve Cap-300M (Lai et al., 2024), a web-crawled dataset with raw Alt Text as our main pre-training dataset for CLIP. We caption COCO-2017 images and visualize their distributions. We use Ve Cap-300M (Lai et al., 2024), a web-crawled dataset with raw Alt Text as our main pre-training dataset for CLIP. Besides Alt Text, we generate several synthetic captions for the study. Then, we use Vi T-B/16 as the vision encoder. The training details can be found in Appendix. |
| Dataset Splits | No | The paper mentions using established datasets like COCO, Flickr30k, and ImageNet, which typically have standard splits. It also references its own dataset, Ve Cap-300M (Lai et al., 2024), as a "main pre-training dataset." However, it does not explicitly detail the specific training, validation, or test splits used for its experiments within the paper itself for reproducibility. For SFT experiments, it states, "We follow the same datasets and configuration as in MM1 (Mc Kinzie et al., 2024)." |
| Hardware Specification | Yes | For the pre-training stage, we pre-train models on up to 512 TPUs with JAX (Bradbury et al., 2018). |
| Software Dependencies | No | The paper mentions using JAX (Bradbury et al., 2018) for pre-training and T5 (Raffel et al., 2020) as a text tokenizer, but it does not specify version numbers for these or other critical software components like programming languages (e.g., Python) or deep learning frameworks (e.g., PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | Table A1: Pre-training hyper-parameters and settings for the in-house CLIP. Batch size 32768, Image size 224x224 (Vi T-B/16), Text maximum length 77, Steps 435,000, Optimizer Adam W (β1 = 0.9, β2 = 0.98), Peak learning rate (LR) 0.0005, LR schedule cosine decays with linear warm-up (first 2k steps), Weight decay 0.2, Dropout rate 0.0. Table A4: Pre-training hyper-parameters and settings for the Multimodal LLM experiments. Table A6: Pre-training hyper-parameters for our diffusion model based on Stable Diffusion 3. |