MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models
Authors: Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, Md Rizwan Parvez
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On evaluation of 30 foundation models, including Claude-3.5-Sonnet, GPT-4o, Gemini-1.5-Pro, none surpasses 67% accuracy, with open-source models performing significantly worse and all models lagging over 20% behind human performance. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering Bangladesh University of Engineering and Technology (BUET) 2Statistics, Islamic University, Bangladesh 3Bangladesh Computer Council (BCC) 4Faculty of Information Technology, Monash University, Melbourne, Australia 5Qatar Computing Research Institute (QCRI). Correspondence to: Mahir Labib Dihan <EMAIL>, Mohammed Eunus Ali <EMAIL>, Md Rizwan Parvez <EMAIL>. |
| Pseudocode | No | The paper describes various methodologies and evaluation processes, including how ReAct agents interact with tools. However, it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like formatting. |
| Open Source Code | Yes | All the resources are available on the project website. We ensure the reproducibility of our results by providing the complete dataset and evaluation codes used in our experiments2 3 4. These resources include the inference process for LLMs, detailing parameters such as temperature, top-k, and top-p. Any updates or bug fixes will be made available in the repository to maintain long-term usability. 2https://github.com/Map Eval/ Map Eval-Textual/ 3https://github.com/Map Eval/Map Eval-API/ 4https://github.com/Map Eval/ Map Eval-Visual/ |
| Open Datasets | Yes | We introduce Map Eval, a novel benchmark designed to evaluate the geo-spatial reasoning capabilities of foundation models and AI agents in complex map-based scenarios. Map Eval addresses a critical gap in existing benchmarks by evaluating models ability to process heterogeneous geospatial contexts, perform compositional reasoning, and interact with real-world map tools. It features three task types API, Visual, and Textual... Comprising 700 unique multiple-choice questions across 180 cities and 54 countries... All the resources are available on the project website. We ensure the reproducibility of our results by providing the complete dataset and evaluation codes used in our experiments. |
| Dataset Splits | Yes | We randomly split the 300 MCQ questions into a training set of 97 questions and a test set of 203 questions. |
| Hardware Specification | No | We gratefully acknowledge the support of Qatar Computing Research Institute (QCRI) for providing access to APIs and computational resources, including GPU support, which were instrumental in conducting this research. |
| Software Dependencies | No | The paper mentions using Large Language Models (LLMs), Vision-Language Models (VLMs), and Google Maps APIs. However, it does not specify concrete version numbers for any software libraries or dependencies used in the experimental setup. |
| Experiment Setup | Yes | We prompt models with the respective context, question, tool usage documentations (only for Map Eval-API), answer format guidelines, and choices. We assess LLMs for Map Eval-Textual, VLMs for Map Eval-Visual, and Re ACT agents (Yao et al., 2023) (known for effective tool interaction (Zhuang et al., 2023)) built on various LLMs for Map Eval API, aligning each task with appropriate model types. Appendix I presents example prompts for all tasks. ... These resources include the inference process for LLMs, detailing parameters such as temperature, top-k, and top-p. |