reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Should VLMs be Pre-trained with Image Data?

Authors: Sedrick Keh, Jean Mercat, Samir Yitzhak Gadre, Kushal Arora, Igor Vasiljevic, Benjamin Burchfiel, Shuran Song, Russ Tedrake, Thomas Kollar, Ludwig Schmidt, Achal Dave

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To understand when image data should be introduced to VLM training, we train a suite of 300 models over various numbers of parameters, varying the amount of text-only pre-training data, as well as the amount, type, and ratio of image pre-training data (Figure 1). Our experiments suggest several key findings: First, incorporating image data during pre-training generally helps, especially after a model has seen many text tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks.
Researcher Affiliation	Collaboration	1Toyota Research Institute, 2Stanford, 3MIT, EMAIL
Pseudocode	No	No explicit pseudocode or algorithm blocks are present in the paper. The methodology is described in narrative text within sections like 'EXPERIMENTAL SETUP' and 'TRAINING PROCEDURE'.
Open Source Code	No	We plan to make our code and our testbed of models publicly available, and we hope that our findings will provide a strong empirical foundation for open-source VLM pre-training.
Open Datasets	Yes	We train with the Data Comp DR-1B caption dataset (Vasu et al., 2024), which is an enhancement over Data Comp-1B (Gadre et al., 2024a) by regenerating higher quality captions. We fine-tune our pre-trained models with the LLa VA dataset (Liu et al., 2024) using the Prismatic framework (Karamcheti et al., 2024). VQA benchmarks: VQAv2 (Goyal et al., 2017): General visual reasoning.
Dataset Splits	Yes	We fine-tune our pre-trained models with the LLa VA dataset (Liu et al., 2024)... For each model trained in the previous step, we fine-tune for {1, 2, 3, 4} epochs. We evaluate on a subset of vision-language tasks used by Karamcheti et al. (2024). We evaluate on a suite of tasks taken from Gadre et al. (2024b) and conduct our evaluations with Eleuther s LM Harness (Gao et al., 2024).
Hardware Specification	Yes	A100 GPU hours are at M = 150. For the 1.4B scale, a batch size of 256 performs slightly better than 512. [...] 106k A100 hours (from Table 2).
Software Dependencies	No	The paper mentions several codebases such as 'Open LM (Gururangan et al., 2023) codebase', 'Prismatic codebase Karamcheti et al. (2024)', and 'Eleuther s LM Harness (Gao et al., 2024)', but does not provide specific version numbers for any of these or for any other software libraries or programming languages used.
Experiment Setup	Yes	The pre-training learning rate schedule is warmup-cosine with a peak learning rate of 10 2 and a final learning rate of 10 5. We train all our models with a token multiplier of 20, following approximate Chinchilla optimal scaling (Hoffmann et al., 2022). For each model trained in the previous step, we fine-tune for {1, 2, 3, 4} epochs. From Appendix D: Learning rate: 3.10 4 Warmup ratio: 0.05 Adam optimizer: β2 : 0.95 Finetune epochs: 4 Batch size: 256.