Do Large Language Models Truly Understand Geometric Structures?

Authors: Xiaofeng Wang, Yiming Wang, Wenhong Zhu, Rui Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using this benchmark, we conduct thorough evaluations of diverse LLMs and identify key limitations in understanding geometric structures. We further propose the Geometry Chain-of-Thought (Geo Co T) method, which enhances LLMs ability to identify geometric relationships, resulting in significant performance improvements. Our work is accessible at https://github.com/banyedy/GeomRel. ... Extensive experiments on the benchmark demonstrate that: Current LLMs perform well in identifying simple geometric relationships but perform poorly in identifying complex structures, especially for Angle-based relationships.
Researcher Affiliation Academia Xiaofeng Wang, Yiming Wang, Wenhong Zhu, Rui Wang Shanghai Jiao Tong University EMAIL
Pseudocode Yes Algorithm 1 Merging Geometric Conditions 1: Input: List of conditions Chain 2: Output: merged condition cmerged 3: cmerged Chain[0] 4: for c in Chain[1:] do 5: consist(c, cmerged) 6: /* Modify the representation of elements in c to be compatible with cmerged */ 7: if c[input] = cmerged[output] then 8: Add c[condition] to cmerged[condition] 9: cmerged[output] c[output] 10: end if 11: end for
Open Source Code Yes Our work is accessible at https://github.com/banyedy/GeomRel.
Open Datasets Yes To this end, we extract the sub-step of geometric relationship identification (GRI) from mainstream geometric problems and construct a dataset called Geom Rel. It can serve as a minimal module for evaluating a model s ability to understand geometric structures. ... Our work is accessible at https://github.com/banyedy/GeomRel.
Dataset Splits Yes To further explore how LLMs can possess stronger GRI abilities, we try to fine-tune LLMs with our Geom Rel. We split it into training, validation, and test sets in a ratio of 6:2:2, and fine-tune the Llama3-8B-Instruct model.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or detailed computer specifications used for running its experiments. It lists various LLMs evaluated but without associated hardware.
Software Dependencies No The paper lists various language models used (e.g., GPT-4o, LLaMA3-8B-Instruct) and some training parameters, but it does not provide specific versions for general software dependencies like Python, PyTorch, or other libraries used in the experimental setup.
Experiment Setup Yes Table 13 presents the complete list of hyperparameters applied to the models (including the hyperparameters of the fine-tuning operations) throughout the evaluation phase. For all baselines, we set temperature τ = 0. ... The hyperparameters include: 'temperature': 0, 'max_tokens': 1024, 'train_batch_size': 4,'finetuning_type': lora, 'learning_rate': 1.0e-4, 'num_train_epochs': 10.0, 'bf16': true.