Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge
Authors: Pierre Dognin, Igor Melnyk, Youssef Mroueh, Inkit Padhi, Mattia Rigotti, Jarret Ross, Yair Schiff, Richard A. Young, Brian Belgodere
JAIR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work details the theory and engineering from our winning submission to the 2020 captioning competition. Our work provides a step towards improved assistive image captioning systems. ... Finally, we give experimental details and extensive ablations studies on the 2020 Viz Wiz Grand Challenge and the competition results in Section 5. |
| Researcher Affiliation | Industry | Pierre Dognin EMAIL Igor Melnyk EMAIL Youssef Mroueh EMAIL Inkit Padhi EMAIL Mattia Rigotti EMAIL Jarret Ross EMAIL Yair Schiff EMAIL IBM Research AI, T.J. Watson Research Center, Yorktown Heights, NY, USA. Richard A. Young EMAIL IBM Research South Africa, Johannesburg, South Africa. Brian Belgodere EMAIL IBM Research, T.J. Watson Research Center, Yorktown Heights, NY, USA. |
| Pseudocode | No | The paper includes architectural diagrams (Figure 2 and Figure 3) and describes the methodology in prose, but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We also visualize the image, caption, objects and the words detected by the OCR to the screen (see the video of the real time demo on https://github.com/IBM/IBM Viz Wiz ). |
| Open Datasets | Yes | Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. ... This gap motivated the introduction of the novel Viz Wiz dataset... We use the Viz Wiz Captions dataset for all our experiments. |
| Dataset Splits | Yes | Table 1: Viz Wiz Captions dataset information. Subset Images Captions Training 23,431 117,155 Validation 7,750 138,750 Testing 8,000 40,000 |
| Hardware Specification | No | The paper mentions GPUs in the context of the real-time demo pipeline ("sent to the first GPU in the pipeline", "sent to the second GPU") and also "cloud machines", but does not provide specific models or specifications for these hardware components. |
| Software Dependencies | No | The paper mentions using a BERT tokenizer, fast Text (Bojanowski et al., 2016), ADAM optimizer, and flask (Grinberg, 2018), but it does not specify version numbers for these software components or libraries. |
| Experiment Setup | Yes | CE Training The CE training is run for 10 epochs, using a batch size of 80. We employ SGD with ADAM optimizer with (β1, β2) = (0.9, 0.98). We warm the learning rate with a factor of 1 for 2000 iterations (minibatches) and then decay it proportionally to 1/i, where i is the iteration step. ... We used a batch size of 80 images/captions as well as the states learned in the ADAM optimizer during CE training. |