MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

Authors: Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, Md Rizwan Parvez

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On evaluation of 30 foundation models, including Claude-3.5-Sonnet, GPT-4o, Gemini-1.5-Pro, none surpasses 67% accuracy, with open-source models performing significantly worse and all models lagging over 20% behind human performance.
Researcher Affiliation Academia 1Department of Computer Science and Engineering Bangladesh University of Engineering and Technology (BUET) 2Statistics, Islamic University, Bangladesh 3Bangladesh Computer Council (BCC) 4Faculty of Information Technology, Monash University, Melbourne, Australia 5Qatar Computing Research Institute (QCRI). Correspondence to: Mahir Labib Dihan <EMAIL>, Mohammed Eunus Ali <EMAIL>, Md Rizwan Parvez <EMAIL>.
Pseudocode No The paper describes various methodologies and evaluation processes, including how ReAct agents interact with tools. However, it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like formatting.
Open Source Code Yes All the resources are available on the project website. We ensure the reproducibility of our results by providing the complete dataset and evaluation codes used in our experiments2 3 4. These resources include the inference process for LLMs, detailing parameters such as temperature, top-k, and top-p. Any updates or bug fixes will be made available in the repository to maintain long-term usability. 2https://github.com/Map Eval/ Map Eval-Textual/ 3https://github.com/Map Eval/Map Eval-API/ 4https://github.com/Map Eval/ Map Eval-Visual/
Open Datasets Yes We introduce Map Eval, a novel benchmark designed to evaluate the geo-spatial reasoning capabilities of foundation models and AI agents in complex map-based scenarios. Map Eval addresses a critical gap in existing benchmarks by evaluating models ability to process heterogeneous geospatial contexts, perform compositional reasoning, and interact with real-world map tools. It features three task types API, Visual, and Textual... Comprising 700 unique multiple-choice questions across 180 cities and 54 countries... All the resources are available on the project website. We ensure the reproducibility of our results by providing the complete dataset and evaluation codes used in our experiments.
Dataset Splits Yes We randomly split the 300 MCQ questions into a training set of 97 questions and a test set of 203 questions.
Hardware Specification No We gratefully acknowledge the support of Qatar Computing Research Institute (QCRI) for providing access to APIs and computational resources, including GPU support, which were instrumental in conducting this research.
Software Dependencies No The paper mentions using Large Language Models (LLMs), Vision-Language Models (VLMs), and Google Maps APIs. However, it does not specify concrete version numbers for any software libraries or dependencies used in the experimental setup.
Experiment Setup Yes We prompt models with the respective context, question, tool usage documentations (only for Map Eval-API), answer format guidelines, and choices. We assess LLMs for Map Eval-Textual, VLMs for Map Eval-Visual, and Re ACT agents (Yao et al., 2023) (known for effective tool interaction (Zhuang et al., 2023)) built on various LLMs for Map Eval API, aligning each task with appropriate model types. Appendix I presents example prompts for all tasks. ... These resources include the inference process for LLMs, detailing parameters such as temperature, top-k, and top-p.