reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

Authors: Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, Md Rizwan Parvez

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On evaluation of 30 foundation models, including Claude-3.5-Sonnet, GPT-4o, Gemini-1.5-Pro, none surpasses 67% accuracy, with open-source models performing significantly worse and all models lagging over 20% behind human performance.
Researcher Affiliation	Academia	1Department of Computer Science and Engineering Bangladesh University of Engineering and Technology (BUET) 2Statistics, Islamic University, Bangladesh 3Bangladesh Computer Council (BCC) 4Faculty of Information Technology, Monash University, Melbourne, Australia 5Qatar Computing Research Institute (QCRI). Correspondence to: Mahir Labib Dihan <EMAIL>, Mohammed Eunus Ali <EMAIL>, Md Rizwan Parvez <EMAIL>.
Pseudocode	No	The paper describes various methodologies and evaluation processes, including how ReAct agents interact with tools. However, it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like formatting.
Open Source Code	Yes	All the resources are available on the project website. We ensure the reproducibility of our results by providing the complete dataset and evaluation codes used in our experiments2 3 4. These resources include the inference process for LLMs, detailing parameters such as temperature, top-k, and top-p. Any updates or bug fixes will be made available in the repository to maintain long-term usability. 2https://github.com/Map Eval/ Map Eval-Textual/ 3https://github.com/Map Eval/Map Eval-API/ 4https://github.com/Map Eval/ Map Eval-Visual/
Open Datasets	Yes	We introduce Map Eval, a novel benchmark designed to evaluate the geo-spatial reasoning capabilities of foundation models and AI agents in complex map-based scenarios. Map Eval addresses a critical gap in existing benchmarks by evaluating models ability to process heterogeneous geospatial contexts, perform compositional reasoning, and interact with real-world map tools. It features three task types API, Visual, and Textual... Comprising 700 unique multiple-choice questions across 180 cities and 54 countries... All the resources are available on the project website. We ensure the reproducibility of our results by providing the complete dataset and evaluation codes used in our experiments.
Dataset Splits	Yes	We randomly split the 300 MCQ questions into a training set of 97 questions and a test set of 203 questions.
Hardware Specification	No	We gratefully acknowledge the support of Qatar Computing Research Institute (QCRI) for providing access to APIs and computational resources, including GPU support, which were instrumental in conducting this research.
Software Dependencies	No	The paper mentions using Large Language Models (LLMs), Vision-Language Models (VLMs), and Google Maps APIs. However, it does not specify concrete version numbers for any software libraries or dependencies used in the experimental setup.
Experiment Setup	Yes	We prompt models with the respective context, question, tool usage documentations (only for Map Eval-API), answer format guidelines, and choices. We assess LLMs for Map Eval-Textual, VLMs for Map Eval-Visual, and Re ACT agents (Yao et al., 2023) (known for effective tool interaction (Zhuang et al., 2023)) built on various LLMs for Map Eval API, aligning each task with appropriate model types. Appendix I presents example prompts for all tasks. ... These resources include the inference process for LLMs, detailing parameters such as temperature, top-k, and top-p.