Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Authors: Shraman Pramanick, Li Jing, Sayan Nag, Jiachen Zhu, Hardik J Shah, Yann LeCun, Rama Chellappa

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of Vo LTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations. Code and pre-trained model are available at https://github.com/Shraman Pramanick/Vo LTA.
Researcher Affiliation Collaboration 1Johns Hopkins University 2Meta 3University of Toronto 4New York University
Pseudocode Yes The pseudo-code for Vo LTA is presented in Appendix A. Algorithm 1 Py Torch-style pseudocode for Vo LTA.
Open Source Code Yes Code and pre-trained model are available at https://github.com/Shraman Pramanick/Vo LTA.
Open Datasets Yes Following Chen et al. (2020d) and Huang et al. (2021), we perform pre-training by appending the VG dataset (Krishna et al., 2017) with COCO2017 (Lin et al., 2014), together consisting of 231k images. We divide our downstream tasks into three categories (i) Uni-modal tasks such as image classification on Image Net (Deng et al., 2009), VOC07 (Everingham et al., 2010), COCO; object detection on VOC07+12, COCO, and instance segmentation on COCO. (ii) Multi-modal fine-grained tasks such as region-level VL tasks referring expression comprehension (REC) on Ref COCO, Ref COCO+, Ref COCOg (Kazemzadeh et al., 2014; Yu et al., 2016), and language-conditioned object detection on COCO and LVIS (Gupta et al., 2019). (iii) Multi-modal coarse-grained tasks such as image-level VL tasks visual question answering on VQAv2 (Antol et al., 2015), visual reasoning on NLVR2 (Suhr et al., 2019), image and text retrieval on Flicker30k (Plummer et al., 2015) and captioning on COCO.
Dataset Splits Yes We exclude any overlap between our pre-training and downstream validation/test splits. Several multi-modal downstream tasks are built based on the COCO dataset, where the validation and test splits of these downstream tasks are scattered across the raw COCO splits. Therefore, during pre-training, we carefully selected the portion of the COCO dataset which does not overlap with the validation/test splits of these multi-modal downstream tasks. For VOC07+12, we used the trainval set comprising 16K images for training a Faster R-CNN (Ren et al., 2015) C-4 backbone for 24K iterations.
Hardware Specification Yes We perform pre-training for 20 epochs with 256 batch-size on 64 V100 GPUs.
Software Dependencies No We use Res Net50/Swin-T/Swin-B (He et al., 2016; Liu et al., 2021) as image encoder and Ro BERTa (Liu et al., 2019) as text encoder. The training pseudo code for Vo LTA is as follows: Algorithm 1 Py Torch-style pseudocode for Vo LTA. For training the detection model, the detectron2 library (Wu et al., 2019) has been used.
Experiment Setup Yes We perform pre-training for 20 epochs with 256 batch-size on 64 V100 GPUs. Following Zbontar et al. (2021), we use the LARS optimizer (You et al., 2017) with a learning rate of 0.2 for the weights and 0.0048 for the biases and batch normalization parameters. We use a learning rate warm-up period of 2 epochs, after which we reduce the learning rate by a factor of 1000 using a cosine decay schedule (Loshchilov & Hutter, 2016). We use 1e-6 weight decay, excluding the biases and batch normalization parameters. We conduct a grid search for the GOT loss hyperparameter (w GOT), and we empirically found the best value to be 100. Appendix D explains other necessary pre-training and downstream hyper-parameters details.