reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Measuring CLEVRness: Black-box Testing of Visual Reasoning Models

Authors: Spyridon Mouselinos, Henryk Michalewski, Mateusz Malinowski

ICLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that CLEVR models, which otherwise could perform at a human level , can easily be fooled by our agent. Our results put in doubt whether data-driven approaches can do reasoning without exploiting the numerous biases that are often present in those datasets. Finally, we also propose a controlled experiment measuring the efficiency of such models to learn and perform reasoning.
Researcher Affiliation	Collaboration	Spyridon Mouselinos University of Warsaw Warsaw, Poland EMAIL Henryk Michalewski University of Warsaw, Google Oxford, U.K. EMAIL Mateusz Malinowski Deep Mind London, U.K. EMAIL
Pseudocode	Yes	A.5 ALGORITHMS We show pseudo-algorithms that we use to (Algorithm 1) calculate rewards, (Algorithm 2) train Adversarial Player, (Algorithm 3) and play a game.
Open Source Code	No	Table 3 shows the URLs to models used in our investigations (also Table 1 in the main paper). We also report if we re-trained a model from scratch (type Architecture) or used already trained models (type Model). Please note that the latter type proves that our testing procedure is fully black-box.
Open Datasets	Yes	CLEVR is a synthetic visual question answering dataset introduced by Johnson et al. (2017a), which consists of about 700k training and 150k validation image-question-answer triplets.
Dataset Splits	Yes	CLEVR is a synthetic visual question answering dataset introduced by Johnson et al. (2017a), which consists of about 700k training and 150k validation image-question-answer triplets.
Hardware Specification	No	All experiments were performed using the Entropy cluster funded by NVIDIA, Intel, the Polish National Science Center grant UMO-2017/26/E/ST6/00622 and ERC Starting Grant TOTAL.
Software Dependencies	Yes	For the generation of new images/scenes we use the open-source Blender Graphics Engine 2 (v2.79b), and the original 3D models of the CLEVR dataset.
Experiment Setup	Yes	We discretize the scene where each axis has values in [ 3, 3] onto N = 7 bins per axis. [...] We use the following values: dr= 1, cr= 0.1, fr = 0.1, isr = 0.8. [...] To train Adversarial Player we use the A2C algorithm with the episode length set to one... We experiment with the following Mini-game sizes 10, 100, 1000.