What If We Recaption Billions of Web Images with LLaMA-3?
Authors: Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results confirm that this enhanced dataset, Recap-Data Comp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe an average of 3.1% enhanced zero-shot performance cross four cross-modal retrieval tasks using a mixed set of the original and our captions. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users text instructions, especially in following complex queries. |
| Researcher Affiliation | Collaboration | 1University of California, Santa Cruz 2University of Edinburgh 3Johns Hopkins University 4Adobe Inc. 5University of Texas at Austin. |
| Pseudocode | No | The paper describes a 'recaptioning pipeline' and 'training procedures' but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The project page is https://www.haqtu.me/Recap-Datacomp-1B/. While a project page is mentioned, the paper does not explicitly provide a direct link to a source-code repository for the methodology described, nor does it clearly state that the code is provided in supplementary materials or via a specific code release statement for the authors' implementation. |
| Open Datasets | Yes | Our recaptioning pipeline is simple: first, we fine-tune a LLa MA-3-8B powered LLa VA-1.5 and then employ it to recaption 1.3 billion images from the Data Comp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-Data Comp-1B, offers substantial benefits in training advanced vision-language models. Image-url and original caption license: the Datacomp-1B distributes the image URL-text samples and metadata under a standard Creative Common CC-BY-4.0 license. Our improvements can be considered as a derivative work of Datacomp-1B. Therefore, we will continue to use the CC BY 4.0 license to release, retain the attribution to the original author, and clearly state that the work is based on the original Data Comp-1B dataset. |
| Dataset Splits | Yes | Evaluation. The efficacy of Recap-CLIP is gauged via several metrics. We evaluate zero-shot image classification on the Image Net-1K dataset (Russakovsky et al., 2015) and assess zero-shot cross-modal retrieval performance using the validation set of MSCOCO 2014 (Lin et al., 2014) and the test set of Flickr30K (Young et al., 2014)1, following the established practices (Radford et al., 2021a; Li et al., 2023b; Zhai et al., 2023; 2022). 1We employ the widely used Karpathy split (Karpathy & Fei Fei, 2015) of MSCOCO and Flickr30K. |
| Hardware Specification | Yes | We benchmark the inference speed of LLa VA-1.5-LLa MA3-8B on the TPU-V4 256 hardware, achieving a throughput of 382 images per second. At this rate, generating captions for the entire Recap-Data Comp-1B dataset ( 1 billion images) would take approximately 29 days of continuous computation. Regarding CLIP training, training a Vi T-L/16 model for two epochs ( 2.56 billion total samples) on Recap-Data Comp-1B requires 1 day on TPU-V3 256 infrastructure. For Di T training, training a base-sized Di T model with a batch size of 2048 for 650K steps takes approximately 7 days using TPU-V3 256 hardware. |
| Software Dependencies | No | The paper mentions models and frameworks like LLa MA-3, LLaVA, CLIP, and Diffusion Transformers, and optimizers like Adam W, but does not provide specific version numbers for any software libraries (e.g., Python, PyTorch, CUDA versions) used for implementation. |
| Experiment Setup | Yes | We set the text token length to 128 to accommodate the learning of long captions presented in Recap-Data Comp-1B. We conduct experiments using three model scales: S/16, B/16, and L/16, with details listed in Appendix Table 7. The Adam W (Loshchilov & Hutter, 2017) optimizer is used for training. In the pre-training phase, the model is trained with 2.56 billion samples with a reduced image size of 112, including a warm-up phase involving 51.2 million samples. The batch size and base learning rate are set to 32,768 and 8e-6, respectively. For the subsequent fine-tuning phase, we increase the image size to 224 and train the model on 128 million samples with a 25.6 million sample warm-up. We adjust the batch size to 16,384 and the learning rate to 4e-7. |