G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model
Authors: Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing HONG, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Utilizing the Geo170K dataset, we introduce G-LLa VA, a model that demonstrates exceptional performance in solving geometric problems. It significantly outperforms GPT4-V on the geometry task of Math Vista benchmark with only 7B parameters. ...We evaluate G-LLa VA on the geometry problems solving (GPS) task (testmini split) of Math Vista (Lu et al., 2023) and test set of Geo QA. ...Main Experiments. We compare MLLMs on testmini split of Math Vista (Lu et al., 2023) benchmark on Table 8. |
| Researcher Affiliation | Collaboration | Jiahui Gao1,2 , Renjie Pi3 , Jipeng Zhang3, Jiacheng Ye2, Wanjun Zhong1, Yufei Wang1, Lanqing Hong1, Jianhua Han1, Hang Xu1, Zhenguo Li1, Lingpeng Kong2 1Noah s Ark Lab 2The University of Hong Kong 3The Hong Kong University of Science and Technology EMAIL, EMAIL |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. It describes methodologies in narrative text and uses figures to illustrate concepts, but no formal algorithm listings. |
| Open Source Code | Yes | Our code, data, and models are publicly accessible at https://github.com/pipilurj/G-LLa VA. |
| Open Datasets | Yes | This dataset, named Geo170K, contains more than 170K geometric image-caption and question-answer pairs. ...Our code, data, and models are publicly accessible at https://github.com/pipilurj/G-LLa VA. |
| Dataset Splits | Yes | We evaluate G-LLa VA on the geometry problems solving (GPS) task (testmini split) of Math Vista (Lu et al., 2023) and test set of Geo QA. ...More details of data split on Geo QA and Geo QA+ is listed in Table 16. Table 16: Data Split of Geo QA and Geo QA+ Dataset Train Validation Test Geo QA+ (Cao and Xiao, 2022) 6027 745 754 Geo QA (Chen et al., 2021) 3499 745 754 |
| Hardware Specification | Yes | For training G-LLa VA-7B, each run requires 10 hours on 8 A40 GPUs (48G of memory). |
| Software Dependencies | Yes | We employ Chat GPT (gpt-3.5-turbo-0613) for data generation. ...The LLM part of G-LLa VA utilizes LLAMA-2 (Touvron et al., 2023) as the language model and employ the pretrained vision transformer Radford et al. (2021) as the vision encoder. We conduct experiments with both 7B and 13B LLMs. |
| Experiment Setup | Yes | During training, the learning rate is set to 3e 5. We expand the images into squares during training, where the extended background color is set to white. For image augmentation, we set the maximum translation distance to 0.25 of the length of longer side. If not otherwise specified, the models are trained for 1 epoch for cross-modal alignment and 2 epochs for instruction tuning, respectively. And the batch sizes are set to 6 and 32 per GPUs, respectively. |