Do Large Language Models Truly Understand Geometric Structures?
Authors: Xiaofeng Wang, Yiming Wang, Wenhong Zhu, Rui Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using this benchmark, we conduct thorough evaluations of diverse LLMs and identify key limitations in understanding geometric structures. We further propose the Geometry Chain-of-Thought (Geo Co T) method, which enhances LLMs ability to identify geometric relationships, resulting in significant performance improvements. Our work is accessible at https://github.com/banyedy/GeomRel. ... Extensive experiments on the benchmark demonstrate that: Current LLMs perform well in identifying simple geometric relationships but perform poorly in identifying complex structures, especially for Angle-based relationships. |
| Researcher Affiliation | Academia | Xiaofeng Wang, Yiming Wang, Wenhong Zhu, Rui Wang Shanghai Jiao Tong University EMAIL |
| Pseudocode | Yes | Algorithm 1 Merging Geometric Conditions 1: Input: List of conditions Chain 2: Output: merged condition cmerged 3: cmerged Chain[0] 4: for c in Chain[1:] do 5: consist(c, cmerged) 6: /* Modify the representation of elements in c to be compatible with cmerged */ 7: if c[input] = cmerged[output] then 8: Add c[condition] to cmerged[condition] 9: cmerged[output] c[output] 10: end if 11: end for |
| Open Source Code | Yes | Our work is accessible at https://github.com/banyedy/GeomRel. |
| Open Datasets | Yes | To this end, we extract the sub-step of geometric relationship identification (GRI) from mainstream geometric problems and construct a dataset called Geom Rel. It can serve as a minimal module for evaluating a model s ability to understand geometric structures. ... Our work is accessible at https://github.com/banyedy/GeomRel. |
| Dataset Splits | Yes | To further explore how LLMs can possess stronger GRI abilities, we try to fine-tune LLMs with our Geom Rel. We split it into training, validation, and test sets in a ratio of 6:2:2, and fine-tune the Llama3-8B-Instruct model. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or detailed computer specifications used for running its experiments. It lists various LLMs evaluated but without associated hardware. |
| Software Dependencies | No | The paper lists various language models used (e.g., GPT-4o, LLaMA3-8B-Instruct) and some training parameters, but it does not provide specific versions for general software dependencies like Python, PyTorch, or other libraries used in the experimental setup. |
| Experiment Setup | Yes | Table 13 presents the complete list of hyperparameters applied to the models (including the hyperparameters of the fine-tuning operations) throughout the evaluation phase. For all baselines, we set temperature τ = 0. ... The hyperparameters include: 'temperature': 0, 'max_tokens': 1024, 'train_batch_size': 4,'finetuning_type': lora, 'learning_rate': 1.0e-4, 'num_train_epochs': 10.0, 'bf16': true. |