reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Does Spatial Cognition Emerge in Frontier Models?

Authors: Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Krähenbühl, Vladlen Koltun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition.
Researcher Affiliation	Industry	Corresponding author: s EMAIL
Pseudocode	No	The paper describes various tasks, experimental procedures, and prompting strategies for evaluating models, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps for the methods used by the models.
Open Source Code	No	The paper does not contain any explicit statement about releasing the source code for their methodology or a link to a code repository. It mentions using third-party tools like Mazelib, Trimesh, Habitat simulator, and vLLM inference engine, but not their own implementation code for the SPACE benchmark or evaluations.
Open Datasets	Yes	We populate each environment with visual landmarks in the form of paintings hanging on the walls, where the painting frames are 3D meshes and the paintings are images from Image Net (Deng et al., 2009).
Dataset Splits	No	The paper evaluates pre-trained frontier models on a new benchmark (SPACE) and describes how tasks and trials are generated for evaluation (e.g., randomizing multiple-choice options, running multiple independent trials for interactive tasks). However, it does not specify typical training/test/validation dataset splits for models being trained within the scope of this paper, nor does it define such splits for the SPACE benchmark itself beyond trial generation for evaluation.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running its experiments or evaluations. It mentions that "Some multimodal models ran out of memory on MCT and CSWM tasks;" which implies hardware limitations but does not specify the hardware itself.
Software Dependencies	No	The paper mentions several software components and libraries, including the Trimesh library, Habitat simulator, Mazelib, and vLLM inference engine, but it does not specify their version numbers. It also refers to various large language models and multimodal models by name and publication year, but these are not software dependencies with specific version numbers in the context of the paper's implementation.
Experiment Setup	Yes	We evaluate frontier models on each of the SPACE tasks using zero-shot prompting. For each task, we design a prompt that provides a detailed description of the task and the expected response format (see the appendix). Image preprocessing: For most of our experiments, we use square images. We provide the images to models as is without preprocessing. The exact image resolution and aspect ratios are task-dependent and listed in Table 5. For egocentric video inputs in the large-scale spatial cognition tasks, the number of frames varies from 61 to 240. Since GPT-4o, GPT-4v and Claude 3.5 Sonnet APIs did not permit 240+ frames as inputs, we subsample the video frames by a factor of 2 before providing them to the model.