reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Authors: Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, Bowen Zhang, Wenze Hu, Juan Tebar, Zhe Gan, Peter Grasch, Meng Cao, Yinfei Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We synthesize several formats of captions including Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+), then conduct extensive pre-training experiments to systematically study the role of synthetic captions and their intersection with original Alt Text across three multimodal foundation models. We use Ve Cap-300M (Lai et al., 2024), a web-crawled dataset with raw Alt Text as our main pre-training dataset for CLIP. Table 1: Effect of different synthetic captions on CLIP with Vi TB/16 as the backbone. Figure 2: Zero-shot retrieval and classification performance of CLIP models.
Researcher Affiliation	Industry	Apple EMAIL
Pseudocode	No	The paper describes its methodology in natural language and illustrates its captioning pipeline with diagrams (e.g., Figure A2), but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	No explicit statement or link regarding the release of source code for the methodology described in this paper is provided.
Open Datasets	Yes	We use Ve Cap-300M (Lai et al., 2024), a web-crawled dataset with raw Alt Text as our main pre-training dataset for CLIP. We caption COCO-2017 images and visualize their distributions. We use Ve Cap-300M (Lai et al., 2024), a web-crawled dataset with raw Alt Text as our main pre-training dataset for CLIP. Besides Alt Text, we generate several synthetic captions for the study. Then, we use Vi T-B/16 as the vision encoder. The training details can be found in Appendix.
Dataset Splits	No	The paper mentions using established datasets like COCO, Flickr30k, and ImageNet, which typically have standard splits. It also references its own dataset, Ve Cap-300M (Lai et al., 2024), as a "main pre-training dataset." However, it does not explicitly detail the specific training, validation, or test splits used for its experiments within the paper itself for reproducibility. For SFT experiments, it states, "We follow the same datasets and configuration as in MM1 (Mc Kinzie et al., 2024)."
Hardware Specification	Yes	For the pre-training stage, we pre-train models on up to 512 TPUs with JAX (Bradbury et al., 2018).
Software Dependencies	No	The paper mentions using JAX (Bradbury et al., 2018) for pre-training and T5 (Raffel et al., 2020) as a text tokenizer, but it does not specify version numbers for these or other critical software components like programming languages (e.g., Python) or deep learning frameworks (e.g., PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	Table A1: Pre-training hyper-parameters and settings for the in-house CLIP. Batch size 32768, Image size 224x224 (Vi T-B/16), Text maximum length 77, Steps 435,000, Optimizer Adam W (β1 = 0.9, β2 = 0.98), Peak learning rate (LR) 0.0005, LR schedule cosine decays with linear warm-up (first 2k steps), Weight decay 0.2, Dropout rate 0.0. Table A4: Pre-training hyper-parameters and settings for the Multimodal LLM experiments. Table A6: Pre-training hyper-parameters for our diffusion model based on Stable Diffusion 3.