Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Training Vision-Language Transformers from Captions
Authors: Liangke Gui, Yingshan Chang, Qiuyuan Huang, Subhojit Som, Alexander G Hauptmann, Jianfeng Gao, Yonatan Bisk
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive and carefully controlled studies suggest that none of the above factors is absolutely important in achieving versatile vision-language representations. We conclude our analysis with suggestions on the choices of initialization, architectural components, and annotation formats targeting a better balance between data efficiency and representation quality. ... We evaluate the effectiveness of our pre-training strategy under both zero-shot and fine-tuning settings. VLC is competitive across a diverse set of standard V+L benchmarks, broadly categorized into Image-Text Retrieval, Image-Text Understanding, and Image-Text Grounding. ... We ablate different combinations of objectives and train VLCBase with 4M image-text pairs. Figure 4 (left) shows that, as the training steps increase, there is a consistent improvement for VLC with MIM. |
| Researcher Affiliation | Collaboration | Liangke Gui 1 Yingshan Chang 1 Qiuyuan Huang2 Subhojit Som2 Alex Hauptmann1 Jianfeng Gao2 Yonatan Bisk1 1Carnegie Mellon University 2Microsoft Research |
| Pseudocode | Yes | Algorithm 1: PUSH input : heatmap ˆ Ar actv, image size (W, H), step t, measure M : ( ˆ Ar actv, B) R, order O Permutation([T, L, B, R]) output : B: box coordinates (x, y, w, h) s.t. 0 x x + w W and 0 y y + h H (x, y, w, h) (0, 0, W, H) B (x, y, w, h) while w > 0 and h > 0 and move = False do if e == T then B (x, y + t, w, h t) else if e == B then B (x, y, w, h t) else if e == R then B (x, y, w t, h) else B (x + t, y, w t, h) if M( ˆ Ar actv, B ) > M( ˆ Ar actv, B) then B B (x, y, w, h) B move True end end end return B |
| Open Source Code | Yes | github.com/guilk/VLC |
| Open Datasets | Yes | Following previous work (Chen et al., 2020b; Kim et al., 2021; Li et al., 2021a; Dou et al., 2022a), our pre-training corpus comprises four commonly used vision-language datasets including COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), Google Conceptual Captions (Sharma et al., 2018) and SBU Captions (Ordonez et al., 2011), totalling 4.0M unique images and 5.1M image-text pairs. To show the benefits of data-scaling, we also use the Vin VL (Zhang et al., 2021) pretraining data which includes Flickr30k (Young et al., 2014), GQA (Hudson & Manning, 2019), VQA (Goyal et al., 2017), VG-QAs (Krishna et al., 2017) and a subset of Open Images (Krasin et al., 2016). |
| Dataset Splits | Yes | We fine-tune our model on image-text retrieval: Flickr30K (Plummer et al., 2015) and MSCOCO (Lin et al., 2014), image-text understanding: VQAv2 (Goyal et al., 2017) and NLVR2 (Suhr et al., 2018), and image-text grounding: Refcoco (Yu et al., 2016; Mao et al., 2016) tasks. ... For retrieval tasks, we follow the standard splits and evaluate our models in the finetuning settings. For VQAv2, we follow the standard practice (Chen et al., 2020b; Li et al., 2021a) to train the models with both training, validation and additional question-answer pairs from Visual Genome while reserving 1, 000 validation samples for internal validation. For Refcoco, we propose a new algorithm to obtain bounding box outputs based on affinities between patch and text encodings. ... MSCOCO contains 123K images, and each image has five corresponding human-written captions. We split the data into 82K/5K/5K training/validation/test images. ... Flickr30K contains 31K images with five captions for each image. We split the data into 30K/1K/1K as the training/validation/test set. |
| Hardware Specification | Yes | Inference time is reported on 1 RTX2080Ti with batch_size=1, averaged across the same 1K randomly selected samples from Refcocog_val(umd) over 5 launches. |
| Software Dependencies | No | For text inputs, we tokenize text with the bert-base-uncased and bert-large-uncased tokenizer, respectively. ... We use Adam W (Loshchilov & Hutter, 2018) with a weight decay of 0.01. ... We use the special token [CLS] as the fused representation of both modalities, and feed h CLS to the ITM head. The learning target LIT M can be formulated as |
| Experiment Setup | Yes | We pretrain two variants of the multi-modal encoder which uses a 86M parameter Vi T-B/16 denoted as VLCBase and 307M parameter Vi T-L/16 denoted as VLCLarge. Both variants are initialized with MAE pre-trained on Image Net-1K without labels. For text inputs, we tokenize text with the bert-base-uncased and bert-large-uncased tokenizer, respectively. The text embedding parameters are learned from scratch, in lieu of loading pre-trained BERT weights. We randomly mask image patches with a probability of 0.6 and text tokens with a probability 0.15. To accelerate training, we follow MAE (He et al., 2022) and skip the mask token [MASK] in the encoder and only apply it in the lightweight decoder. We use Adam W (Loshchilov & Hutter, 2018) with a weight decay of 0.01. The learning rate is warmed-up to 1e 4 in the first 10% of total training steps and is decayed to zero for the rest of the training following a linear schedule. During pre-training, we resize the shorter edge of input images to 384, take random image crops of resolution 384 384, and apply Rand Augment (Cubuk et al., 2020). We pre-train for 200k steps with a batch size of 4, 096. ... For downstream understanding and retrieval tasks, we fine-tune our model with a learning rate of 5e 4 for 10 epochs. We use 480 480 as the input image resolution for the VQA task and 384 384 for NLVR2 and image-text retrieval tasks. ... We finetune VLC on the combined Refcoco/+/g training sets for 50 epochs with Adam W (Loshchilov & Hutter, 2018) optimizer and a 5e 4 learning rate. |