reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GNS: Solving Plane Geometry Problems by Neural-Symbolic Reasoning with Multi-Modal LLMs

Authors: Maizhen Ning, Zihao Zhou, Qiufeng Wang, Xiaowei Huang, Kaizhu Huang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, our Phi3-Vision-based MLLM wins first place on the PGPs solving task of Math Vista benchmark, outperforming GPT-4o, Gemini Ultra and other much larger MLLMs. While LLa VA-13B-based MLLM markedly exceeded other close-source and open-source MLLMs on the Math Verse benchmark and also achieved the new SOTA on Geo QA dataset.
Researcher Affiliation	Academia	Maizhen Ning1,2, Zihao Zhou1,2, Qiufeng Wang1 , Xiaowei Huang2, Kaizhu Huang3 1School of Advanced Technology, Xi an Jiaotong-Liverpool University 2University of Liverpool 3Duke Kunshan University EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in prose, detailing the components of GNS: Knowledge Prediction, Symbolic Parsing, Problem Reasoning, and Symbolic Computation, often using mathematical formulations. However, it does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks with structured steps.
Open Source Code	Yes	Project https://github.com/ning-mz/GNS
Open Datasets	Yes	To handle the data issue, we construct a multi-task plane geometry problem related dataset GNS-260K, which is the largest PGPs dataset so far. [...] We construct GNS-260K based on two existing PGP datasets: PGPS9K (Zhang, Yin, and Liu 2023) and Geo QA+ (Cao and Xiao 2022).
Dataset Splits	Yes	To obtain a standard performance measurement with different MLLMs rather than simply test on a single base plane geometry problem dataset, we selected two benchmarks including Math Vista (Lu et al. 2024b) and Math Verse (Zhang et al. 2024b). Specifically, we evaluate the Geometry Problem Solving task from the test-mini set of Math Vista (GPS) and the entire testmini set of Math Verse. [...] The testmini set in Math Verse has 3,940 samples in total and 64.7% of them are plane geometry problems. [...] All diagrams and base problems are from the existing PGP datasets including the training set of both PGPS9K (Zhang, Yin, and Liu 2023) and Geo QA+ (Cao and Xiao 2022)
Hardware Specification	Yes	trained on 4 NVIDIA A800 80GB GPUs.
Software Dependencies	No	we use a Python library Sym Py (Meurer et al. 2017) as the symbolic computation tool. While the paper mentions SymPy as a Python library, it does not provide specific version numbers for either Python or SymPy.
Experiment Setup	Yes	We fully finetune the MLLMs with learning rate 5e 5 for Deep Seek-VL-1.3B and 3e 5 for the others, 2 epochs training, batch size 8 per GPU and trained on 4 NVIDIA A800 80GB GPUs.