GIT: A Generative Image-to-text Transformer for Vision and Language
Authors: Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. ... Without bells and whistles, our GIT establishes new state of the arts on numerous challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on Text Caps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. ... We demonstrate new state-of-the-art performance over numerous tasks on image/video captioning and QA (Table 1), without the dependency on object detectors, object tags, and OCR. On Text Caps, we surpass the human performance for the first time. This implies that a simple network architecture can also achieve strong performance with scaling. ... We comprehensively evaluate the captioning performance on the widely-used Karpathy split (Karpathy & Li, 2015) of COCO (Lin et al., 2014) and Flickr30K (Young et al., 2014), the COCO test set, nocaps (Agrawal et al., 2019)2 which focuses on novel objects, Text Caps (Sidorov et al., 2020) which focuses on scene-text understanding, and Viz Wiz-Captions (Gurari et al., 2020). |
| Researcher Affiliation | Industry | Jianfeng Wang EMAIL Zhengyuan Yang EMAIL Xiaowei Hu EMAIL Linjie Li EMAIL Kevin Lin EMAIL Zhe Gan EMAIL Zicheng Liu EMAIL Ce Liu EMAIL Lijuan Wang EMAIL Microsoft Cloud and AI |
| Pseudocode | No | The paper describes the network architecture and training process using textual descriptions and diagrams (e.g., Figure 2) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like formatting. |
| Open Source Code | No | The paper does not contain an unambiguous statement that the authors are releasing their code for the methodology described, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | We collect 0.8B image-text pairs for pre-training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a). ... Following the established setup, we evaluate on six standard benchmarks, including ICDAR 2013 (IC13) (Karatzas et al., 2013), ICDAR 2015 (IC15) (Karatzas et al., 2015), IIIT 5K-Words (IIIT) (Mishra et al., 2012), Street View Text (SVT) (Wang et al., 2011), Street View Text-Perspective (SVTP) (Phan et al., 2013), and CUTE80 (CUTE) (Risnumawan et al., 2014). |
| Dataset Splits | Yes | We comprehensively evaluate the captioning performance on the widely-used Karpathy split (Karpathy & Li, 2015) of COCO (Lin et al., 2014) and Flickr30K (Young et al., 2014), the COCO test set, nocaps (Agrawal et al., 2019)2 which focuses on novel objects, Text Caps (Sidorov et al., 2020) which focuses on scene-text understanding, and Viz Wiz-Captions (Gurari et al., 2020) which focuses on the real use case by the vision-impaired people. |
| Hardware Specification | No | The paper discusses model size and data scale, and computational resource limitations ('due to computational resource limitation'), but does not provide specific details on the hardware (e.g., GPU models, CPU types) used for experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation or experimentation. |
| Experiment Setup | Yes | The hidden dimension (D) is 768. The text decoder consists of 6 randomly-initialized transformer blocks. The total number of model parameters is 0.7 billion. The learning rates of the image encoder and the decoder are 1e-5 and 5e-5, respectively, and follow the cosine decay to 0. The total number of epochs is 2. During inference, the beam size is 4 and the length penalty (Wu et al., 2016) is 0.6 by default. |