Vision Learners Meet Web Image-Text Pairs

Authors: Bingchen Zhao, Quan Cui, Hao Wu, Osamu Yoshie, Cheng Yang, Oisin Mac Aodha

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, we conduct a benchmark study of representative self-supervised pre-training methods on large-scale web data in a like-for-like setting. We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training. We present empirical results across a variety of downstream vision tasks to further validate the advantages of MUG. For example, MUG outperforms the previous best performing methods by 2.0% m Io U when transferred to the ADE20k benchmark and by 0.5% on Image Net-1k classification. Extensive ablations are conducted to help better illuminate the impact of critical model and design choices.
Researcher Affiliation Collaboration Bingchen Zhao1, Quan Cui2, Hao Wu2, Osamu Yoshie3, Cheng Yang2, Oisin Mac Aodha1 1 University of Edinburgh 2 Bytedance 3 Waseda University
Pseudocode Yes Algorithm 1 Pseudocode for MUG. # img, txt: image-text paired data # txt_mask: causal mask for captioning def text_decoder(q, kv, mask): q = single_modal_attn(q, mask) cap_res = multimodal_attn(q, kv) return cap_res patch_img = patchify(img) masked_token = masking(patch_img) # [N, L, D] latent = vit_encoder(masked_token) # [N, L, D] # Generative objective for image recon_img = mae_decoder(latent) # [N, L, D] recon_loss = mse_loss(img, recon_img) # Generative objective for text label, txt = txt[:, 1:, :], txt[:, :-1, :] txt_feat = tokenizer(txt) # [N, L, D] cap_res = text_decoder(q=txt_feat, kv=latent, mask=txt_mask) cap_loss = ce_loss(label, cap_res) loss = recon_weight * recon_loss + cap_weight *
Open Source Code Yes The code is available at https: //huggingface.co/spaces/tennant/MUG_caption.
Open Datasets Yes Many self-supervised learning methods are pre-trained on the well-curated Image Net-1k dataset. ... on a web dataset CC3M (Sharma et al., 2018) for 400 epochs. When transferring the representation to Image Net-1k (Deng et al., 2009)... Semantic segmentation on ADE20K. We transfer MUG to a semantic segmentation task using the ADE20K dataset (Zhou et al., 2017b). ... We train models on the publicly available CC3M (Sharma et al., 2018) and LAION400M (Schuhmann et al., 2021) datasets. ... In Fig. 4, we provide generated images and captions produced by MUG on the MS-COCO (Lin et al., 2014) and PASCAL-VOC datasets (Everingham et al., 2015).
Dataset Splits Yes When transferring the representation to Image Net-1k (Deng et al., 2009), we follow the widely used fine-tuning recipe introduced by Bao et al. (2022); He et al. (2022). ... Semantic segmentation on ADE20K. ...training recipes follow the default setting provided in mmsegmentation (Contributors, 2020).
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions optimizers like AdamW and LARS, and refers to "mmsegmentation (Contributors, 2020)" for ADE20K settings, but it does not provide specific version numbers for software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch), or CUDA.
Experiment Setup Yes A.1 Implementation Details Pre-training. The default setting is in Table 10, and hyper-parameters mainly follow He et al. (2022) for fair comparisons. ... Table 10: Pre-training settings. ... Table 11: End-to-end fine-tuning settings. ... Table 12: Linear probing settings.