What Makes a Maze Look Like a Maze?

Authors: Joy Hsu, Jiayuan Mao, Joshua B Tenenbaum, Noah Goodman, Jiajun Wu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We systematically evaluate DSG and different methods in reasoning on our new Visual Abstractions Benchmark, which consists of diverse, real-world images of abstract concepts and corresponding question-answer pairs labeled by humans. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models, and is a step toward human-aligned understanding of visual abstractions. We evaluate Deep Schema Grounding on the Visual Abstractions Benchmark, and show that DSG consistently improves performance of vision-language models across question types, abstract concept categories, and base models.
Researcher Affiliation Academia Joy Hsu Stanford University EMAIL Jiayuan Mao MIT EMAIL Joshua B. Tenenbaum MIT EMAIL Noah D. Goodman Stanford University EMAIL Jiajun Wu Stanford University EMAIL
Pseudocode Yes A visual abstraction schema is a concise program that defines a directed acyclic graphical (DAG) representation of a particular concept. As illustrated in Figure 2, each node in the schema corresponds to a subcomponent concept of the higher-level abstract concept. For example, the formation of a maze can be decomposed into three components: the layout, the construction of the walls, and the positioning of the entry and exit of the maze. The dependencies among individual components yield a DAG configuration in this case, the placement of the entry and the exit of the maze depends on the layout of the maze. (Figure 2: gen(concept=maze) = gen(layout | concept=maze) gen(walls | concept=maze) gen(entry-exit | concept=maze, layout))
Open Source Code No The paper explicitly states the release of the 'Visual Abstractions Benchmark' dataset, but there is no explicit statement or link provided for the open-sourcing of the Deep Schema Grounding (DSG) methodology's code itself.
Open Datasets Yes To investigate the capabilities of models in understanding visual abstractions, we introduce the Visual Abstractions Benchmark (VAB). VAB is a visual question-answering benchmark that consists of diverse, real-world images representing abstract concepts. ... We present examples for each type of question along with corresponding images and answers in Appendix B, and release our benchmark here.
Dataset Splits Yes VAB is a visual question-answering benchmark that consists of diverse, real-world images representing abstract concepts. ... The Visual Abstractions Benchmark comprises 540 of such examples, with answers labeled by 5 human annotators from Prolific. ... It consists of 180 images and 3 questions per image, with a total of 540 test examples.
Hardware Specification Yes Open-sourced models, LLa VA (Liu et al., 2024) and Instruct BLIP (Dai et al., 2024), and the API calls of the aforementioned integrated LLMs with APIs, were run inference-only with 1 A40 on an internal cluster.
Software Dependencies No The paper mentions specific models and APIs used, along with their publication years (e.g., 'Open AI s API for GPT-4o (Open AI, 2024)', 'LLa VA (Liu et al., 2024)'). However, it does not provide specific version numbers for underlying software components or libraries (e.g., Python, PyTorch, CUDA versions) which are required for full reproducibility.
Experiment Setup Yes Illustrated in Figure 2, the DSG framework consists of three main steps: (1) extracting a schema of the concept, (2) hierarchically grounding the schema to the visual input, and (3) leveraging the resolved schema as input to the base VLM. ... All results are averaged over 5 runs. ... In Appendix C, we provide our prompt to the LLM and all extracted schemas, as well as results from a human study evaluating the quality of the generated schemas.