Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Authors: Aleksandar Stanić, Sergi Caelles, Michael Tschannen

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples. 3 Experiments Tasks. We evaluate our method on four datasets: Ref COCO, Ref COCO+ (Yu et al., 2016), GQA (Hudson & Manning, 2019) and NExT-QA (Xiao et al., 2021) used in previous work (Surís et al., 2023).
Researcher Affiliation Industry Aleksandar Stanić EMAIL Google DeepMind Sergi Caelles EMAIL Google Research Michael Tschannen EMAIL Google DeepMind
Pseudocode Yes A.4 Prompt listings A.4.1 Ref COCO and GQA Viper GPT API 1 import math 3 class Image Patch: [...]
Open Source Code No The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository for their work. It only mentions using the 'official GitHub repository' for a baseline model (Viper GPT).
Open Datasets Yes Tasks. We evaluate our method on four datasets: Ref COCO, Ref COCO+ (Yu et al., 2016), GQA (Hudson & Manning, 2019) and NExT-QA (Xiao et al., 2021) used in previous work (Surís et al., 2023).
Dataset Splits Yes We use the test-dev split of the GQA dataset, as in Viper GPT. In NExT-QA, the task is to answer a temporally compositional question by selecting one of the given multiple choice options. As in Viper GPT, we use NExT-QA hard split Buch et al. (2022). On Ref COCO, we sample 100 few-shot random samples from the training set, run Zero-Shot framework on them, sort the resulting programs by their IoU, and select the top 16 programs.
Hardware Specification No The paper states: 'For code generation, we use a code instruction-tuned version of PaLM 2 (Anil et al., 2023) code-bison accessible via the Google Cloud API (Google, 2023).' This indicates usage of a cloud API but does not specify the underlying hardware (e.g., GPU/TPU models) used for running the experiments.
Software Dependencies No The paper mentions several models and APIs used (e.g., OWLv2, MiDaS from PyTorch hub, Google Cloud Vertex AI API for code-bison, OpenAI Python API for GPT-3.5-turbo), but it does not specify concrete version numbers for the programming language (Python), libraries (e.g., PyTorch version), or client APIs required for replication.
Experiment Setup Yes In Table 3, we report the scores for different code-bison LLM temperatures: 0, 0.4, 0.8 and 1.0. We found the deterministic case to underperform compared to the cases with a temperature higher than zero. [...] Early in our work, we settled on the code-bison LLM temperature of 0.4 and did not tune it further. Table 4 shows the effect of using different thresholds for OWLv2 open vocabulary detector. [...] On both datasets, the threshold of 0.1 achieves the best results, so by default we use this threshold in all our experiments.