Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge

Authors: Pierre Dognin, Igor Melnyk, Youssef Mroueh, Inkit Padhi, Mattia Rigotti, Jarret Ross, Yair Schiff, Richard A. Young, Brian Belgodere

JAIR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This work details the theory and engineering from our winning submission to the 2020 captioning competition. Our work provides a step towards improved assistive image captioning systems. ... Finally, we give experimental details and extensive ablations studies on the 2020 Viz Wiz Grand Challenge and the competition results in Section 5.
Researcher Affiliation Industry Pierre Dognin EMAIL Igor Melnyk EMAIL Youssef Mroueh EMAIL Inkit Padhi EMAIL Mattia Rigotti EMAIL Jarret Ross EMAIL Yair Schiff EMAIL IBM Research AI, T.J. Watson Research Center, Yorktown Heights, NY, USA. Richard A. Young EMAIL IBM Research South Africa, Johannesburg, South Africa. Brian Belgodere EMAIL IBM Research, T.J. Watson Research Center, Yorktown Heights, NY, USA.
Pseudocode No The paper includes architectural diagrams (Figure 2 and Figure 3) and describes the methodology in prose, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We also visualize the image, caption, objects and the words detected by the OCR to the screen (see the video of the real time demo on https://github.com/IBM/IBM Viz Wiz ).
Open Datasets Yes Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. ... This gap motivated the introduction of the novel Viz Wiz dataset... We use the Viz Wiz Captions dataset for all our experiments.
Dataset Splits Yes Table 1: Viz Wiz Captions dataset information. Subset Images Captions Training 23,431 117,155 Validation 7,750 138,750 Testing 8,000 40,000
Hardware Specification No The paper mentions GPUs in the context of the real-time demo pipeline ("sent to the first GPU in the pipeline", "sent to the second GPU") and also "cloud machines", but does not provide specific models or specifications for these hardware components.
Software Dependencies No The paper mentions using a BERT tokenizer, fast Text (Bojanowski et al., 2016), ADAM optimizer, and flask (Grinberg, 2018), but it does not specify version numbers for these software components or libraries.
Experiment Setup Yes CE Training The CE training is run for 10 epochs, using a batch size of 80. We employ SGD with ADAM optimizer with (β1, β2) = (0.9, 0.98). We warm the learning rate with a factor of 1 for 2000 iterations (minibatches) and then decay it proportionally to 1/i, where i is the iteration step. ... We used a batch size of 80 images/captions as well as the states learned in the ADAM optimizer during CE training.