reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Authors: Aleksandar Stanić, Sergi Caelles, Michael Tschannen

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples. 3 Experiments Tasks. We evaluate our method on four datasets: Ref COCO, Ref COCO+ (Yu et al., 2016), GQA (Hudson & Manning, 2019) and NExT-QA (Xiao et al., 2021) used in previous work (Surís et al., 2023).
Researcher Affiliation	Industry	Aleksandar Stanić EMAIL Google DeepMind Sergi Caelles EMAIL Google Research Michael Tschannen EMAIL Google DeepMind
Pseudocode	Yes	A.4 Prompt listings A.4.1 Ref COCO and GQA Viper GPT API 1 import math 3 class Image Patch: [...]
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository for their work. It only mentions using the 'official GitHub repository' for a baseline model (Viper GPT).
Open Datasets	Yes	Tasks. We evaluate our method on four datasets: Ref COCO, Ref COCO+ (Yu et al., 2016), GQA (Hudson & Manning, 2019) and NExT-QA (Xiao et al., 2021) used in previous work (Surís et al., 2023).
Dataset Splits	Yes	We use the test-dev split of the GQA dataset, as in Viper GPT. In NExT-QA, the task is to answer a temporally compositional question by selecting one of the given multiple choice options. As in Viper GPT, we use NExT-QA hard split Buch et al. (2022). On Ref COCO, we sample 100 few-shot random samples from the training set, run Zero-Shot framework on them, sort the resulting programs by their IoU, and select the top 16 programs.
Hardware Specification	No	The paper states: 'For code generation, we use a code instruction-tuned version of PaLM 2 (Anil et al., 2023) code-bison accessible via the Google Cloud API (Google, 2023).' This indicates usage of a cloud API but does not specify the underlying hardware (e.g., GPU/TPU models) used for running the experiments.
Software Dependencies	No	The paper mentions several models and APIs used (e.g., OWLv2, MiDaS from PyTorch hub, Google Cloud Vertex AI API for code-bison, OpenAI Python API for GPT-3.5-turbo), but it does not specify concrete version numbers for the programming language (Python), libraries (e.g., PyTorch version), or client APIs required for replication.
Experiment Setup	Yes	In Table 3, we report the scores for different code-bison LLM temperatures: 0, 0.4, 0.8 and 1.0. We found the deterministic case to underperform compared to the cases with a temperature higher than zero. [...] Early in our work, we settled on the code-bison LLM temperature of 0.4 and did not tune it further. Table 4 shows the effect of using different thresholds for OWLv2 open vocabulary detector. [...] On both datasets, the threshold of 0.1 achieves the best results, so by default we use this threshold in all our experiments.