GNS: Solving Plane Geometry Problems by Neural-Symbolic Reasoning with Multi-Modal LLMs

Authors: Maizhen Ning, Zihao Zhou, Qiufeng Wang, Xiaowei Huang, Kaizhu Huang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, our Phi3-Vision-based MLLM wins first place on the PGPs solving task of Math Vista benchmark, outperforming GPT-4o, Gemini Ultra and other much larger MLLMs. While LLa VA-13B-based MLLM markedly exceeded other close-source and open-source MLLMs on the Math Verse benchmark and also achieved the new SOTA on Geo QA dataset.
Researcher Affiliation Academia Maizhen Ning*1,2, Zihao Zhou*1,2, Qiufeng Wang1 , Xiaowei Huang2, Kaizhu Huang3 1School of Advanced Technology, Xi an Jiaotong-Liverpool University 2University of Liverpool 3Duke Kunshan University EMAIL, EMAIL
Pseudocode No The paper describes the methodology in prose, detailing the components of GNS: Knowledge Prediction, Symbolic Parsing, Problem Reasoning, and Symbolic Computation, often using mathematical formulations. However, it does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks with structured steps.
Open Source Code Yes Project https://github.com/ning-mz/GNS
Open Datasets Yes To handle the data issue, we construct a multi-task plane geometry problem related dataset GNS-260K, which is the largest PGPs dataset so far. [...] We construct GNS-260K based on two existing PGP datasets: PGPS9K (Zhang, Yin, and Liu 2023) and Geo QA+ (Cao and Xiao 2022).
Dataset Splits Yes To obtain a standard performance measurement with different MLLMs rather than simply test on a single base plane geometry problem dataset, we selected two benchmarks including Math Vista (Lu et al. 2024b) and Math Verse (Zhang et al. 2024b). Specifically, we evaluate the Geometry Problem Solving task from the test-mini set of Math Vista (GPS) and the entire testmini set of Math Verse. [...] The testmini set in Math Verse has 3,940 samples in total and 64.7% of them are plane geometry problems. [...] All diagrams and base problems are from the existing PGP datasets including the training set of both PGPS9K (Zhang, Yin, and Liu 2023) and Geo QA+ (Cao and Xiao 2022)
Hardware Specification Yes trained on 4 NVIDIA A800 80GB GPUs.
Software Dependencies No we use a Python library Sym Py (Meurer et al. 2017) as the symbolic computation tool. While the paper mentions SymPy as a Python library, it does not provide specific version numbers for either Python or SymPy.
Experiment Setup Yes We fully finetune the MLLMs with learning rate 5e 5 for Deep Seek-VL-1.3B and 3e 5 for the others, 2 epochs training, batch size 8 per GPU and trained on 4 NVIDIA A800 80GB GPUs.