Measuring CLEVRness: Black-box Testing of Visual Reasoning Models

Authors: Spyridon Mouselinos, Henryk Michalewski, Mateusz Malinowski

ICLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that CLEVR models, which otherwise could perform at a human level , can easily be fooled by our agent. Our results put in doubt whether data-driven approaches can do reasoning without exploiting the numerous biases that are often present in those datasets. Finally, we also propose a controlled experiment measuring the efficiency of such models to learn and perform reasoning.
Researcher Affiliation Collaboration Spyridon Mouselinos University of Warsaw Warsaw, Poland EMAIL Henryk Michalewski University of Warsaw, Google Oxford, U.K. EMAIL Mateusz Malinowski Deep Mind London, U.K. EMAIL
Pseudocode Yes A.5 ALGORITHMS We show pseudo-algorithms that we use to (Algorithm 1) calculate rewards, (Algorithm 2) train Adversarial Player, (Algorithm 3) and play a game.
Open Source Code No Table 3 shows the URLs to models used in our investigations (also Table 1 in the main paper). We also report if we re-trained a model from scratch (type Architecture) or used already trained models (type Model). Please note that the latter type proves that our testing procedure is fully black-box.
Open Datasets Yes CLEVR is a synthetic visual question answering dataset introduced by Johnson et al. (2017a), which consists of about 700k training and 150k validation image-question-answer triplets.
Dataset Splits Yes CLEVR is a synthetic visual question answering dataset introduced by Johnson et al. (2017a), which consists of about 700k training and 150k validation image-question-answer triplets.
Hardware Specification No All experiments were performed using the Entropy cluster funded by NVIDIA, Intel, the Polish National Science Center grant UMO-2017/26/E/ST6/00622 and ERC Starting Grant TOTAL.
Software Dependencies Yes For the generation of new images/scenes we use the open-source Blender Graphics Engine 2 (v2.79b), and the original 3D models of the CLEVR dataset.
Experiment Setup Yes We discretize the scene where each axis has values in [ 3, 3] onto N = 7 bins per axis. [...] We use the following values: dr= 1, cr= 0.1, fr = 0.1, isr = 0.8. [...] To train Adversarial Player we use the A2C algorithm with the episode length set to one... We experiment with the following Mini-game sizes 10, 100, 1000.