reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Evaluating Spatial Understanding of Large Language Models

Authors: Yutaro Yamada, Yihan Bao, Andrew Kyle Lampinen, Jungo Kasai, Ilker Yildirim

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We design natural-language navigation tasks and evaluate the ability of LLMs, in particular GPT-3.5-turbo, GPT-4, and Llama2 series models, to represent and reason about spatial structures. These tasks reveal substantial variability in LLM performance across different spatial structures, including square, hexagonal, and triangular grids, rings, and trees. In extensive error analysis, we find that LLMs mistakes reflect both spatial and non-spatial factors.
Researcher Affiliation	Academia	Yale University Toyota Technological Institute at Chicago EMAIL
Pseudocode	No	The paper describes the methodology in prose and figures (e.g., Figure 2 for example questions and answers) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Our code and data are available at https://github.com/runopti/Spatial Eval LLM, https://huggingface.co/datasets/ yyamada/Spatial Eval LLM
Open Datasets	Yes	Our code and data are available at https://github.com/runopti/Spatial Eval LLM, https://huggingface.co/datasets/ yyamada/Spatial Eval LLM. For each question, we randomly select the object names from the Image Net-1k labels to fill every location of the spatial grid to create the underlying map.
Dataset Splits	No	The paper describes generating samples for evaluation (e.g., "We collect a total of 6,100 prediction results...", "We prepare 200 samples for each area") but does not specify traditional training, validation, or test dataset splits for model training, as it primarily evaluates pre-trained LLMs. It focuses on how many prompts were generated for evaluation rather than dataset splits in the context of training a model.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It only mentions the LLM models themselves (e.g., GPT-4, Llama2).
Software Dependencies	No	The paper mentions the LLM models used (e.g., GPT-3.5-turbo-0301, GPT-4-0314, Llama2-7B) and their decoding parameters. However, it does not specify any programming languages, libraries, or other software dependencies with version numbers.
Experiment Setup	Yes	The decoding parameters we use are: frequency penalty = 0.0, presence penalty = 0.0, temperature = 1.0, top p = 1.0. All Llama models are Llama-Chat models and the Code Llama model is the Instruct variant. The context window sizes for these models are 4,096 tokens except for GPT-4, which has 8,192 tokens. We focus on zero-shot experiments, where we used the following system prompt: You are given a task to solve. Make sure to output an answer after 'Answer:' without any explanation.