reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

Authors: Renqiu Xia, mingsheng li, Hancheng Ye, Wenjie Wu, Hongbin Zhou, Jiakang Yuan, Tianshuo Peng, Xinyu Cai, Xiangchao Yan, Bin Wang, Conghui He, Botian Shi, Tao Chen, Junchi Yan, Bo Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that Geo X outperforms both generalists and geometric specialists on publicly recognized benchmarks, such as Geo QA, Uni Geo, Geometry3K, and PGPS9k. In Sec. 4, we conduct extensive experiments on four widely recognized benchmarks to evaluate Geo X s ability in reasoning complex and diverse geometric problems, where our approach achieves state-of-the-art results. Insightful analyses and ablation experiments are performed to further validate the effectiveness of our method.
Researcher Affiliation	Academia	1 School of Computer Science & Artificial Intelligence, Shanghai Jiao Tong University, 2 Shanghai Artificial Intelligence Laboratory, 3 School of Information Science and Technology, Fudan University, 4 MMLab, The Chinese University of Hong Kong
Pseudocode	No	The paper describes the methodology in prose and figures (e.g., Figure 2 for overview) but does not include any explicitly labeled pseudocode or algorithm blocks for its internal mechanisms. The code-like sequences are the outputs generated by the model (e.g., 'Sum x+14 x-20 x x-10 42 21 29 360,' in Figure 4), which represent the solution steps rather than the model's algorithm in pseudocode format.
Open Source Code	Yes	Our code is available at https://github.com/Alpha-Innovator/Geo X
Open Datasets	Yes	To mitigate the deficiencies of the existing visual encoders in comprehending geometric images, we collect more than 120K diagrams1 from the web and electronic textbooks to equip Vi T with prior knowledge of geometry, abbreviated as Geo-Vi T. (footnote 1: https://huggingface.co/datasets/U4R/Geo X-data). We conduct experiments on four widely recognized geometry benchmarks: Geo QA (Chen et al., 2021), Uni Geo (Chen et al., 2022), Geometry3K (Lu et al., 2021), and PGPS9K (Zhang et al., 2023c).
Dataset Splits	Yes	Geo QA comprises 4,998 geometry problems... divided into training, validation, and test sets at a ratio of 7.0:1.5:1.5. Geometry3K ... divided into training, validation, and test sets in a 0.7:0.1:0.2 ratio. PGDP5K ... divided into training, validation, and test sets with a 0.7:0.1:0.2 split. PGPS9K ... split into training and test sets, with 8,433 samples for training and 589 for testing.
Hardware Specification	Yes	We implement Geo X using Py Torch and conduct experiments on more than eight A100 (80GB) GPUs. We finetune these models on 4 A100 (80GB) GPUs, respectively.
Software Dependencies	No	The paper mentions 'We implement Geo X using Py Torch' but does not specify a version number for PyTorch or any other software libraries, which is required for reproducibility.
Experiment Setup	Yes	We optimize the diagram encoder using MAE VIT-B (He et al., 2022) checkpoints, training it for 800 epochs with a batch size of 256 and an initial learning rate of 6.4e-5. We initialize the symbol decoder with LLEMMA-7B (Azerbayev et al., 2023) weights and train it for 5 epochs with a batch size of 32 and an initial learning rate of 1e-6. For geometry-language alignment, we train the GS-Former for 360 epochs with a batch size of 256 and an initial learning rate of 1e-4. The number of queries in GS-Former is set to 8. During inference, we employ a beam search size of 10. Table 11 provides hyperparameters for end-to-end visual instruction tuning, including Training Batch Size 64, Scheduler Cosine Annealing, Optimizer Adam W, Warmup Ratio (0.05 or 0.03 depending on dataset), Epochs (100, 80, 45, or 30), Learning Rate (3e-5, 6e-5, or 2e-5), and Evaluation Steps (200 or 400).