Multi-Modal Answer Validation for Knowledge-Based VQA

Authors: Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi2712-2721

AAAI 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments with OK-VQA, a challenging knowledge-based VQA dataset, demonstrate that MAVEx achieves new state-of-the-art results.
Researcher Affiliation Collaboration Jialin Wu1, Jiasen Lu2, Ashish Sabharwal2, Roozbeh Mottaghi2 1 The University of Texas at Austin 2 Allen Institute for AI EMAIL, EMAIL
Pseudocode No The paper describes the framework steps but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our code is available at https://github.com/jialinwu17/MAVEX
Open Datasets Yes We evaluate MAVEx on OK-VQA (Marino et al. 2019), the largest knowledge-based VQA dataset to date.
Dataset Splits Yes We evaluate MAVEx on OK-VQA (Marino et al. 2019), the largest knowledge-based VQA dataset at present. The dataset contains 14,031 images and 14,055 questions... We use the finetuned model to extract the top 5 answers for each question in the training and test set.
Hardware Specification Yes We use Pytorch 1.4 on a single TITAN V GPU with 12M memory for each run, and it generally costs 22 hours to train a single model.
Software Dependencies No The paper mentions 'Pytorch 1.4' but does not provide version numbers for other significant software dependencies such as Allen NLP, T5 model, Mask R-CNN, or specific BERT/Tiny BERT implementations used.
Experiment Setup Yes We finetune the Vi LBERT-multi-task model on OK-VQA using the default configuration for 150 epochs for answer candidate generation... We train the system for 75 epochs using a learning rate of 2e-5 for the Vi LBERT parameters and 5e-5 for the additional parameters introduced in the validation module... The number of hidden units in the multi-head attention modules is set to 512.