Improved baselines for vision-language pre-training

Authors: Enrico Fini, Pietro Astolfi, Adriana Romero-Soriano, Jakob Verbeek, Michal Drozdzal

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we first propose, implement and evaluate several baselines obtained by combining contrastive learning with recent advances in self-supervised learning. [...] With our improved training recipe for CLIP, we obtain state-of-the-art performance on four standard datasets, and consistently outperform prior work (up to +4% on the largest dataset), while being substantially simpler. The code is available at https://github.com/facebookresearch/clip-rocket
Researcher Affiliation Collaboration 1FAIR, Meta, 2University of Trento, 3Mila, Quebec AI Institute, 4Mc Gill University, 5Canada CIFAR AI Chair
Pseudocode Yes Pseudo-code 1 CLIP training procedure [...] Pseudo-code 2 Config for image and text augmentations [...] Pseudo-code 3 Detailed implementation of CLIP.
Open Source Code Yes The code is available at https://github.com/facebookresearch/clip-rocket
Open Datasets Yes The Conceptual Captions dataset is composed of image-caption pairs [...] CC3M (Sharma et al., 2018) composed of 3.3M image-text pairs [...] CC12M (Changpinyo et al., 2021) comprising 12.4M pairs [...] The Yahoo Flickr Creative Commons dataset is composed of 100M of image-text pairs (Thomee et al., 2016). [...] We test the models in the zero-shot image classification task, which is performed by computing the cosine similarity between the image representation and the representation of all the classes encoded as text and choosing the most similar class. The performance is measured in terms of accuracy on the Image Net1000 (Deng et al., 2009) validation set. Moreover, following Radford et al. (2021), we investigate the models generalization using an extended set of 22 vision benchmarks of different kinds most of them belong to the widely adopted Visual Task Adaption Benchmark (VTAB) (Zhai et al., 2019).
Dataset Splits Yes The performance is measured in terms of accuracy on the Image Net1000 (Deng et al., 2009) validation set. [...] Moreover, following Radford et al. (2021), we investigate the models generalization using an extended set of 22 vision benchmarks of different kinds most of them belong to the widely adopted Visual Task Adaption Benchmark (VTAB) (Zhai et al., 2019).
Hardware Specification Yes the reported values were computed using eight Nvidia V100-SMX2 32GB GPUs and our recipe with Res Net-50 backbone (see Sec. 5.2).
Software Dependencies No The paper mentions using the Adam W optimizer but does not specify programming languages, deep learning frameworks, or other library versions with specific numbers.
Experiment Setup Yes In most experiments, we pre-train the model for 32 epochs following Li et al. (2021), using the Adam W optimizer (Loshchilov & Hutter, 2017) (betas 0.9 and 0.98), with learning rate 0.003 (or 0.002 for experiments on the 29M dataset) regulated by linear warmup (1 epoch) plus cosine scheduler (final learning rate 10 5). Mini-batches are composed of 4096 image-text pairs. To provide regularization to the training, weight decay is applied with magnitude 0.1 on all parameters except for biases and normalization layers. For the smaller CC3M dataset, we use weight decay 0.5. The dropout probability in text encoder varies depending on the dataset size, e.g., no dropout on YFCC15M and probability 0.2 on CC3M, while label smoothing is applied with a smoothing factor of 0.1 regardless of the dataset.