Does Spatial Cognition Emerge in Frontier Models?
Authors: Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Krähenbühl, Vladlen Koltun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition. |
| Researcher Affiliation | Industry | Corresponding author: s EMAIL |
| Pseudocode | No | The paper describes various tasks, experimental procedures, and prompting strategies for evaluating models, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps for the methods used by the models. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code for their methodology or a link to a code repository. It mentions using third-party tools like Mazelib, Trimesh, Habitat simulator, and vLLM inference engine, but not their own implementation code for the SPACE benchmark or evaluations. |
| Open Datasets | Yes | We populate each environment with visual landmarks in the form of paintings hanging on the walls, where the painting frames are 3D meshes and the paintings are images from Image Net (Deng et al., 2009). |
| Dataset Splits | No | The paper evaluates pre-trained frontier models on a new benchmark (SPACE) and describes how tasks and trials are generated for evaluation (e.g., randomizing multiple-choice options, running multiple independent trials for interactive tasks). However, it does not specify typical training/test/validation dataset splits for models being trained within the scope of this paper, nor does it define such splits for the SPACE benchmark itself beyond trial generation for evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running its experiments or evaluations. It mentions that "Some multimodal models ran out of memory on MCT and CSWM tasks;" which implies hardware limitations but does not specify the hardware itself. |
| Software Dependencies | No | The paper mentions several software components and libraries, including the Trimesh library, Habitat simulator, Mazelib, and vLLM inference engine, but it does not specify their version numbers. It also refers to various large language models and multimodal models by name and publication year, but these are not software dependencies with specific version numbers in the context of the paper's implementation. |
| Experiment Setup | Yes | We evaluate frontier models on each of the SPACE tasks using zero-shot prompting. For each task, we design a prompt that provides a detailed description of the task and the expected response format (see the appendix). Image preprocessing: For most of our experiments, we use square images. We provide the images to models as is without preprocessing. The exact image resolution and aspect ratios are task-dependent and listed in Table 5. For egocentric video inputs in the large-scale spatial cognition tasks, the number of frames varies from 61 to 240. Since GPT-4o, GPT-4v and Claude 3.5 Sonnet APIs did not permit 240+ frames as inputs, we subsample the video frames by a factor of 2 before providing them to the model. |