CoCa: Contrastive Captioners are Image-Text Foundation Models

Authors: Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Co Ca achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (Image Net, Kinetics400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, No Caps). Notably on Image Net classification, Co Ca obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and 91.0% with a finetuned encoder.
Researcher Affiliation Industry Jiahui Yu EMAIL Zirui Wang EMAIL Vijay Vasudevan Mojtaba Seyedhosseini Google Research Equal contribution.
Pseudocode No The paper describes the methodology using natural language and architectural diagrams (Figures 1, 2, 3) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes We use the JFT-3B dataset (Zhai et al., 2021a) with label names as the paired texts, and the ALIGN dataset (Jia et al., 2021) with noisy alt-texts... Our visual recognition experiments are conducted on Image Net (Deng et al., 2009) as image recognition benchmark, and multiple video datasets including Kinetics-400 (Kay et al., 2017), Kinetics-600 (Carreira et al., 2018), Kinetics-700 (Carreira et al., 2019), Moments-in-Time (Monfort et al., 2019)... We evaluate Co Ca on the two standard image-text retrieval benchmarks: MSCOCO (Chen et al., 2015) and Flickr30K (Plummer et al., 2015)... We consider three popular multimodal understaning benchmarks: visual question answering (VQA v2 (Goyal et al., 2017)), visual entailment (SNLI-VE (Xie et al., 2019)), and visual reasoning (NLVR2 (Suhr et al., 2018))... We evaluate video-text retrieval using Co Ca on MSR-VTT (Xu et al., 2016) using the full split.
Dataset Splits Yes For image captioning, we apply simple cross-entropy loss (same as the captioning loss used in pretraining) and finetune the model on the training split of MSCOCO to predict for MSCOCO test split and No Caps online evaluation.
Hardware Specification Yes Pretraining Co Ca takes about 5 days on 2,048 Cloud TPUv4 chips.
Software Dependencies No The paper mentions using the Lingvo framework and GSPMD for implementation and scaling, and a sentence-piece model for tokenization, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Following (Pham et al., 2021a), we use a batch size of 65,536 image-text pairs, where half of each batch comes from JFT and ALIGN, respectively. All models are trained on the combined contrastive and captioning objectives in Eq.(4) for 500k steps, roughly corresponding to 5 epochs on JFT and 10 epochs on ALIGN. As shown later in our studies, we find a larger captioning loss weight is better and thus λCap = 2.0 and λCon = 1.0. Following Jia et al. (2021), we apply a contrastive loss with a trainable temperature τ with an initial value of 0.07.