reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improved baselines for vision-language pre-training

Authors: Enrico Fini, Pietro Astolfi, Adriana Romero-Soriano, Jakob Verbeek, Michal Drozdzal

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we first propose, implement and evaluate several baselines obtained by combining contrastive learning with recent advances in self-supervised learning. [...] With our improved training recipe for CLIP, we obtain state-of-the-art performance on four standard datasets, and consistently outperform prior work (up to +4% on the largest dataset), while being substantially simpler. The code is available at https://github.com/facebookresearch/clip-rocket
Researcher Affiliation	Collaboration	1FAIR, Meta, 2University of Trento, 3Mila, Quebec AI Institute, 4Mc Gill University, 5Canada CIFAR AI Chair
Pseudocode	Yes	Pseudo-code 1 CLIP training procedure [...] Pseudo-code 2 Config for image and text augmentations [...] Pseudo-code 3 Detailed implementation of CLIP.
Open Source Code	Yes	The code is available at https://github.com/facebookresearch/clip-rocket
Open Datasets	Yes	The Conceptual Captions dataset is composed of image-caption pairs [...] CC3M (Sharma et al., 2018) composed of 3.3M image-text pairs [...] CC12M (Changpinyo et al., 2021) comprising 12.4M pairs [...] The Yahoo Flickr Creative Commons dataset is composed of 100M of image-text pairs (Thomee et al., 2016). [...] We test the models in the zero-shot image classification task, which is performed by computing the cosine similarity between the image representation and the representation of all the classes encoded as text and choosing the most similar class. The performance is measured in terms of accuracy on the Image Net1000 (Deng et al., 2009) validation set. Moreover, following Radford et al. (2021), we investigate the models generalization using an extended set of 22 vision benchmarks of different kinds most of them belong to the widely adopted Visual Task Adaption Benchmark (VTAB) (Zhai et al., 2019).
Dataset Splits	Yes	The performance is measured in terms of accuracy on the Image Net1000 (Deng et al., 2009) validation set. [...] Moreover, following Radford et al. (2021), we investigate the models generalization using an extended set of 22 vision benchmarks of different kinds most of them belong to the widely adopted Visual Task Adaption Benchmark (VTAB) (Zhai et al., 2019).
Hardware Specification	Yes	the reported values were computed using eight Nvidia V100-SMX2 32GB GPUs and our recipe with Res Net-50 backbone (see Sec. 5.2).
Software Dependencies	No	The paper mentions using the Adam W optimizer but does not specify programming languages, deep learning frameworks, or other library versions with specific numbers.
Experiment Setup	Yes	In most experiments, we pre-train the model for 32 epochs following Li et al. (2021), using the Adam W optimizer (Loshchilov & Hutter, 2017) (betas 0.9 and 0.98), with learning rate 0.003 (or 0.002 for experiments on the 29M dataset) regulated by linear warmup (1 epoch) plus cosine scheduler (final learning rate 10 5). Mini-batches are composed of 4096 image-text pairs. To provide regularization to the training, weight decay is applied with magnitude 0.1 on all parameters except for biases and normalization layers. For the smaller CC3M dataset, we use weight decay 0.5. The dropout probability in text encoder varies depending on the dataset size, e.g., no dropout on YFCC15M and probability 0.2 on CC3M, while label smoothing is applied with a smoothing factor of 0.1 regardless of the dataset.