reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Do Large Language Models Truly Understand Geometric Structures?

Authors: Xiaofeng Wang, Yiming Wang, Wenhong Zhu, Rui Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using this benchmark, we conduct thorough evaluations of diverse LLMs and identify key limitations in understanding geometric structures. We further propose the Geometry Chain-of-Thought (Geo Co T) method, which enhances LLMs ability to identify geometric relationships, resulting in significant performance improvements. Our work is accessible at https://github.com/banyedy/GeomRel. ... Extensive experiments on the benchmark demonstrate that: Current LLMs perform well in identifying simple geometric relationships but perform poorly in identifying complex structures, especially for Angle-based relationships.
Researcher Affiliation	Academia	Xiaofeng Wang, Yiming Wang, Wenhong Zhu, Rui Wang Shanghai Jiao Tong University EMAIL
Pseudocode	Yes	Algorithm 1 Merging Geometric Conditions 1: Input: List of conditions Chain 2: Output: merged condition cmerged 3: cmerged Chain[0] 4: for c in Chain[1:] do 5: consist(c, cmerged) 6: /* Modify the representation of elements in c to be compatible with cmerged */ 7: if c[input] = cmerged[output] then 8: Add c[condition] to cmerged[condition] 9: cmerged[output] c[output] 10: end if 11: end for
Open Source Code	Yes	Our work is accessible at https://github.com/banyedy/GeomRel.
Open Datasets	Yes	To this end, we extract the sub-step of geometric relationship identification (GRI) from mainstream geometric problems and construct a dataset called Geom Rel. It can serve as a minimal module for evaluating a model s ability to understand geometric structures. ... Our work is accessible at https://github.com/banyedy/GeomRel.
Dataset Splits	Yes	To further explore how LLMs can possess stronger GRI abilities, we try to fine-tune LLMs with our Geom Rel. We split it into training, validation, and test sets in a ratio of 6:2:2, and fine-tune the Llama3-8B-Instruct model.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or detailed computer specifications used for running its experiments. It lists various LLMs evaluated but without associated hardware.
Software Dependencies	No	The paper lists various language models used (e.g., GPT-4o, LLaMA3-8B-Instruct) and some training parameters, but it does not provide specific versions for general software dependencies like Python, PyTorch, or other libraries used in the experimental setup.
Experiment Setup	Yes	Table 13 presents the complete list of hyperparameters applied to the models (including the hyperparameters of the fine-tuning operations) throughout the evaluation phase. For all baselines, we set temperature τ = 0. ... The hyperparameters include: 'temperature': 0, 'max_tokens': 1024, 'train_batch_size': 4,'finetuning_type': lora, 'learning_rate': 1.0e-4, 'num_train_epochs': 10.0, 'bf16': true.