GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
Authors: Renqiu Xia, mingsheng li, Hancheng Ye, Wenjie Wu, Hongbin Zhou, Jiakang Yuan, Tianshuo Peng, Xinyu Cai, Xiangchao Yan, Bin Wang, Conghui He, Botian Shi, Tao Chen, Junchi Yan, Bo Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that Geo X outperforms both generalists and geometric specialists on publicly recognized benchmarks, such as Geo QA, Uni Geo, Geometry3K, and PGPS9k. In Sec. 4, we conduct extensive experiments on four widely recognized benchmarks to evaluate Geo X s ability in reasoning complex and diverse geometric problems, where our approach achieves state-of-the-art results. Insightful analyses and ablation experiments are performed to further validate the effectiveness of our method. |
| Researcher Affiliation | Academia | 1 School of Computer Science & Artificial Intelligence, Shanghai Jiao Tong University, 2 Shanghai Artificial Intelligence Laboratory, 3 School of Information Science and Technology, Fudan University, 4 MMLab, The Chinese University of Hong Kong |
| Pseudocode | No | The paper describes the methodology in prose and figures (e.g., Figure 2 for overview) but does not include any explicitly labeled pseudocode or algorithm blocks for its internal mechanisms. The code-like sequences are the outputs generated by the model (e.g., 'Sum x+14 x-20 x x-10 42 21 29 360,' in Figure 4), which represent the solution steps rather than the model's algorithm in pseudocode format. |
| Open Source Code | Yes | Our code is available at https://github.com/Alpha-Innovator/Geo X |
| Open Datasets | Yes | To mitigate the deficiencies of the existing visual encoders in comprehending geometric images, we collect more than 120K diagrams1 from the web and electronic textbooks to equip Vi T with prior knowledge of geometry, abbreviated as Geo-Vi T. (footnote 1: https://huggingface.co/datasets/U4R/Geo X-data). We conduct experiments on four widely recognized geometry benchmarks: Geo QA (Chen et al., 2021), Uni Geo (Chen et al., 2022), Geometry3K (Lu et al., 2021), and PGPS9K (Zhang et al., 2023c). |
| Dataset Splits | Yes | Geo QA comprises 4,998 geometry problems... divided into training, validation, and test sets at a ratio of 7.0:1.5:1.5. Geometry3K ... divided into training, validation, and test sets in a 0.7:0.1:0.2 ratio. PGDP5K ... divided into training, validation, and test sets with a 0.7:0.1:0.2 split. PGPS9K ... split into training and test sets, with 8,433 samples for training and 589 for testing. |
| Hardware Specification | Yes | We implement Geo X using Py Torch and conduct experiments on more than eight A100 (80GB) GPUs. We finetune these models on 4 A100 (80GB) GPUs, respectively. |
| Software Dependencies | No | The paper mentions 'We implement Geo X using Py Torch' but does not specify a version number for PyTorch or any other software libraries, which is required for reproducibility. |
| Experiment Setup | Yes | We optimize the diagram encoder using MAE VIT-B (He et al., 2022) checkpoints, training it for 800 epochs with a batch size of 256 and an initial learning rate of 6.4e-5. We initialize the symbol decoder with LLEMMA-7B (Azerbayev et al., 2023) weights and train it for 5 epochs with a batch size of 32 and an initial learning rate of 1e-6. For geometry-language alignment, we train the GS-Former for 360 epochs with a batch size of 256 and an initial learning rate of 1e-4. The number of queries in GS-Former is set to 8. During inference, we employ a beam search size of 10. Table 11 provides hyperparameters for end-to-end visual instruction tuning, including Training Batch Size 64, Scheduler Cosine Annealing, Optimizer Adam W, Warmup Ratio (0.05 or 0.03 depending on dataset), Epochs (100, 80, 45, or 30), Learning Rate (3e-5, 6e-5, or 2e-5), and Evaluation Steps (200 or 400). |