Evaluating Spatial Understanding of Large Language Models

Authors: Yutaro Yamada, Yihan Bao, Andrew Kyle Lampinen, Jungo Kasai, Ilker Yildirim

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We design natural-language navigation tasks and evaluate the ability of LLMs, in particular GPT-3.5-turbo, GPT-4, and Llama2 series models, to represent and reason about spatial structures. These tasks reveal substantial variability in LLM performance across different spatial structures, including square, hexagonal, and triangular grids, rings, and trees. In extensive error analysis, we find that LLMs mistakes reflect both spatial and non-spatial factors.
Researcher Affiliation Academia Yale University Toyota Technological Institute at Chicago EMAIL
Pseudocode No The paper describes the methodology in prose and figures (e.g., Figure 2 for example questions and answers) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our code and data are available at https://github.com/runopti/Spatial Eval LLM, https://huggingface.co/datasets/ yyamada/Spatial Eval LLM
Open Datasets Yes Our code and data are available at https://github.com/runopti/Spatial Eval LLM, https://huggingface.co/datasets/ yyamada/Spatial Eval LLM. For each question, we randomly select the object names from the Image Net-1k labels to fill every location of the spatial grid to create the underlying map.
Dataset Splits No The paper describes generating samples for evaluation (e.g., "We collect a total of 6,100 prediction results...", "We prepare 200 samples for each area") but does not specify traditional training, validation, or test dataset splits for model training, as it primarily evaluates pre-trained LLMs. It focuses on how many *prompts* were generated for evaluation rather than dataset splits in the context of training a model.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It only mentions the LLM models themselves (e.g., GPT-4, Llama2).
Software Dependencies No The paper mentions the LLM models used (e.g., GPT-3.5-turbo-0301, GPT-4-0314, Llama2-7B) and their decoding parameters. However, it does not specify any programming languages, libraries, or other software dependencies with version numbers.
Experiment Setup Yes The decoding parameters we use are: frequency penalty = 0.0, presence penalty = 0.0, temperature = 1.0, top p = 1.0. All Llama models are Llama-Chat models and the Code Llama model is the Instruct variant. The context window sizes for these models are 4,096 tokens except for GPT-4, which has 8,192 tokens. We focus on zero-shot experiments, where we used the following system prompt: You are given a task to solve. Make sure to output an answer after 'Answer:' without any explanation.