reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GIT: A Generative Image-to-text Transformer for Vision and Language

Authors: Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. ... Without bells and whistles, our GIT establishes new state of the arts on numerous challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on Text Caps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. ... We demonstrate new state-of-the-art performance over numerous tasks on image/video captioning and QA (Table 1), without the dependency on object detectors, object tags, and OCR. On Text Caps, we surpass the human performance for the first time. This implies that a simple network architecture can also achieve strong performance with scaling. ... We comprehensively evaluate the captioning performance on the widely-used Karpathy split (Karpathy & Li, 2015) of COCO (Lin et al., 2014) and Flickr30K (Young et al., 2014), the COCO test set, nocaps (Agrawal et al., 2019)2 which focuses on novel objects, Text Caps (Sidorov et al., 2020) which focuses on scene-text understanding, and Viz Wiz-Captions (Gurari et al., 2020).
Researcher Affiliation	Industry	Jianfeng Wang EMAIL Zhengyuan Yang EMAIL Xiaowei Hu EMAIL Linjie Li EMAIL Kevin Lin EMAIL Zhe Gan EMAIL Zicheng Liu EMAIL Ce Liu EMAIL Lijuan Wang EMAIL Microsoft Cloud and AI
Pseudocode	No	The paper describes the network architecture and training process using textual descriptions and diagrams (e.g., Figure 2) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like formatting.
Open Source Code	No	The paper does not contain an unambiguous statement that the authors are releasing their code for the methodology described, nor does it provide a direct link to a source-code repository.
Open Datasets	Yes	We collect 0.8B image-text pairs for pre-training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a). ... Following the established setup, we evaluate on six standard benchmarks, including ICDAR 2013 (IC13) (Karatzas et al., 2013), ICDAR 2015 (IC15) (Karatzas et al., 2015), IIIT 5K-Words (IIIT) (Mishra et al., 2012), Street View Text (SVT) (Wang et al., 2011), Street View Text-Perspective (SVTP) (Phan et al., 2013), and CUTE80 (CUTE) (Risnumawan et al., 2014).
Dataset Splits	Yes	We comprehensively evaluate the captioning performance on the widely-used Karpathy split (Karpathy & Li, 2015) of COCO (Lin et al., 2014) and Flickr30K (Young et al., 2014), the COCO test set, nocaps (Agrawal et al., 2019)2 which focuses on novel objects, Text Caps (Sidorov et al., 2020) which focuses on scene-text understanding, and Viz Wiz-Captions (Gurari et al., 2020) which focuses on the real use case by the vision-impaired people.
Hardware Specification	No	The paper discusses model size and data scale, and computational resource limitations ('due to computational resource limitation'), but does not provide specific details on the hardware (e.g., GPU models, CPU types) used for experiments.
Software Dependencies	No	The paper does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation or experimentation.
Experiment Setup	Yes	The hidden dimension (D) is 768. The text decoder consists of 6 randomly-initialized transformer blocks. The total number of model parameters is 0.7 billion. The learning rates of the image encoder and the decoder are 1e-5 and 5e-5, respectively, and follow the cosine decay to 0. The total number of epochs is 2. During inference, the beam size is 4 and the length penalty (Wu et al., 2016) is 0.6 by default.