Should VLMs be Pre-trained with Image Data?

Authors: Sedrick Keh, Jean Mercat, Samir Yitzhak Gadre, Kushal Arora, Igor Vasiljevic, Benjamin Burchfiel, Shuran Song, Russ Tedrake, Thomas Kollar, Ludwig Schmidt, Achal Dave

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To understand when image data should be introduced to VLM training, we train a suite of 300 models over various numbers of parameters, varying the amount of text-only pre-training data, as well as the amount, type, and ratio of image pre-training data (Figure 1). Our experiments suggest several key findings: First, incorporating image data during pre-training generally helps, especially after a model has seen many text tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks.
Researcher Affiliation Collaboration 1Toyota Research Institute, 2Stanford, 3MIT, EMAIL
Pseudocode No No explicit pseudocode or algorithm blocks are present in the paper. The methodology is described in narrative text within sections like 'EXPERIMENTAL SETUP' and 'TRAINING PROCEDURE'.
Open Source Code No We plan to make our code and our testbed of models publicly available, and we hope that our findings will provide a strong empirical foundation for open-source VLM pre-training.
Open Datasets Yes We train with the Data Comp DR-1B caption dataset (Vasu et al., 2024), which is an enhancement over Data Comp-1B (Gadre et al., 2024a) by regenerating higher quality captions. We fine-tune our pre-trained models with the LLa VA dataset (Liu et al., 2024) using the Prismatic framework (Karamcheti et al., 2024). VQA benchmarks: VQAv2 (Goyal et al., 2017): General visual reasoning.
Dataset Splits Yes We fine-tune our pre-trained models with the LLa VA dataset (Liu et al., 2024)... For each model trained in the previous step, we fine-tune for {1, 2, 3, 4} epochs. We evaluate on a subset of vision-language tasks used by Karamcheti et al. (2024). We evaluate on a suite of tasks taken from Gadre et al. (2024b) and conduct our evaluations with Eleuther s LM Harness (Gao et al., 2024).
Hardware Specification Yes A100 GPU hours are at M = 150. For the 1.4B scale, a batch size of 256 performs slightly better than 512. [...] 106k A100 hours (from Table 2).
Software Dependencies No The paper mentions several codebases such as 'Open LM (Gururangan et al., 2023) codebase', 'Prismatic codebase Karamcheti et al. (2024)', and 'Eleuther s LM Harness (Gao et al., 2024)', but does not provide specific version numbers for any of these or for any other software libraries or programming languages used.
Experiment Setup Yes The pre-training learning rate schedule is warmup-cosine with a peak learning rate of 10 2 and a final learning rate of 10 5. We train all our models with a token multiplier of 20, following approximate Chinchilla optimal scaling (Hoffmann et al., 2022). For each model trained in the previous step, we fine-tune for {1, 2, 3, 4} epochs. From Appendix D: Learning rate: 3.10 4 Warmup ratio: 0.05 Adam optimizer: β2 : 0.95 Finetune epochs: 4 Batch size: 256.