Should VLMs be Pre-trained with Image Data?
Authors: Sedrick Keh, Jean Mercat, Samir Yitzhak Gadre, Kushal Arora, Igor Vasiljevic, Benjamin Burchfiel, Shuran Song, Russ Tedrake, Thomas Kollar, Ludwig Schmidt, Achal Dave
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To understand when image data should be introduced to VLM training, we train a suite of 300 models over various numbers of parameters, varying the amount of text-only pre-training data, as well as the amount, type, and ratio of image pre-training data (Figure 1). Our experiments suggest several key findings: First, incorporating image data during pre-training generally helps, especially after a model has seen many text tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. |
| Researcher Affiliation | Collaboration | 1Toyota Research Institute, 2Stanford, 3MIT, EMAIL |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are present in the paper. The methodology is described in narrative text within sections like 'EXPERIMENTAL SETUP' and 'TRAINING PROCEDURE'. |
| Open Source Code | No | We plan to make our code and our testbed of models publicly available, and we hope that our findings will provide a strong empirical foundation for open-source VLM pre-training. |
| Open Datasets | Yes | We train with the Data Comp DR-1B caption dataset (Vasu et al., 2024), which is an enhancement over Data Comp-1B (Gadre et al., 2024a) by regenerating higher quality captions. We fine-tune our pre-trained models with the LLa VA dataset (Liu et al., 2024) using the Prismatic framework (Karamcheti et al., 2024). VQA benchmarks: VQAv2 (Goyal et al., 2017): General visual reasoning. |
| Dataset Splits | Yes | We fine-tune our pre-trained models with the LLa VA dataset (Liu et al., 2024)... For each model trained in the previous step, we fine-tune for {1, 2, 3, 4} epochs. We evaluate on a subset of vision-language tasks used by Karamcheti et al. (2024). We evaluate on a suite of tasks taken from Gadre et al. (2024b) and conduct our evaluations with Eleuther s LM Harness (Gao et al., 2024). |
| Hardware Specification | Yes | A100 GPU hours are at M = 150. For the 1.4B scale, a batch size of 256 performs slightly better than 512. [...] 106k A100 hours (from Table 2). |
| Software Dependencies | No | The paper mentions several codebases such as 'Open LM (Gururangan et al., 2023) codebase', 'Prismatic codebase Karamcheti et al. (2024)', and 'Eleuther s LM Harness (Gao et al., 2024)', but does not provide specific version numbers for any of these or for any other software libraries or programming languages used. |
| Experiment Setup | Yes | The pre-training learning rate schedule is warmup-cosine with a peak learning rate of 10 2 and a final learning rate of 10 5. We train all our models with a token multiplier of 20, following approximate Chinchilla optimal scaling (Hoffmann et al., 2022). For each model trained in the previous step, we fine-tune for {1, 2, 3, 4} epochs. From Appendix D: Learning rate: 3.10 4 Warmup ratio: 0.05 Adam optimizer: β2 : 0.95 Finetune epochs: 4 Batch size: 256. |