reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

What Makes a Maze Look Like a Maze?

Authors: Joy Hsu, Jiayuan Mao, Joshua B Tenenbaum, Noah Goodman, Jiajun Wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We systematically evaluate DSG and different methods in reasoning on our new Visual Abstractions Benchmark, which consists of diverse, real-world images of abstract concepts and corresponding question-answer pairs labeled by humans. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models, and is a step toward human-aligned understanding of visual abstractions. We evaluate Deep Schema Grounding on the Visual Abstractions Benchmark, and show that DSG consistently improves performance of vision-language models across question types, abstract concept categories, and base models.
Researcher Affiliation	Academia	Joy Hsu Stanford University EMAIL Jiayuan Mao MIT EMAIL Joshua B. Tenenbaum MIT EMAIL Noah D. Goodman Stanford University EMAIL Jiajun Wu Stanford University EMAIL
Pseudocode	Yes	A visual abstraction schema is a concise program that defines a directed acyclic graphical (DAG) representation of a particular concept. As illustrated in Figure 2, each node in the schema corresponds to a subcomponent concept of the higher-level abstract concept. For example, the formation of a maze can be decomposed into three components: the layout, the construction of the walls, and the positioning of the entry and exit of the maze. The dependencies among individual components yield a DAG configuration in this case, the placement of the entry and the exit of the maze depends on the layout of the maze. (Figure 2: gen(concept=maze) = gen(layout \| concept=maze) gen(walls \| concept=maze) gen(entry-exit \| concept=maze, layout))
Open Source Code	No	The paper explicitly states the release of the 'Visual Abstractions Benchmark' dataset, but there is no explicit statement or link provided for the open-sourcing of the Deep Schema Grounding (DSG) methodology's code itself.
Open Datasets	Yes	To investigate the capabilities of models in understanding visual abstractions, we introduce the Visual Abstractions Benchmark (VAB). VAB is a visual question-answering benchmark that consists of diverse, real-world images representing abstract concepts. ... We present examples for each type of question along with corresponding images and answers in Appendix B, and release our benchmark here.
Dataset Splits	Yes	VAB is a visual question-answering benchmark that consists of diverse, real-world images representing abstract concepts. ... The Visual Abstractions Benchmark comprises 540 of such examples, with answers labeled by 5 human annotators from Prolific. ... It consists of 180 images and 3 questions per image, with a total of 540 test examples.
Hardware Specification	Yes	Open-sourced models, LLa VA (Liu et al., 2024) and Instruct BLIP (Dai et al., 2024), and the API calls of the aforementioned integrated LLMs with APIs, were run inference-only with 1 A40 on an internal cluster.
Software Dependencies	No	The paper mentions specific models and APIs used, along with their publication years (e.g., 'Open AI s API for GPT-4o (Open AI, 2024)', 'LLa VA (Liu et al., 2024)'). However, it does not provide specific version numbers for underlying software components or libraries (e.g., Python, PyTorch, CUDA versions) which are required for full reproducibility.
Experiment Setup	Yes	Illustrated in Figure 2, the DSG framework consists of three main steps: (1) extracting a schema of the concept, (2) hierarchically grounding the schema to the visual input, and (3) leveraging the resolved schema as input to the base VLM. ... All results are averaged over 5 runs. ... In Appendix C, we provide our prompt to the LLM and all extracted schemas, as well as results from a human study evaluating the quality of the generated schemas.