Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning
Authors: Wenwen Zhuang, Xin Huang, Xiantao Zhang, Jin Zeng
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on multiple mathematical reasoning benchmarks demonstrate that the MLLMs trained with Math-PUMA surpass most open-source MLLMs. Our approach effectively narrows the performance gap for problems presented in different modalities. Experimental results on three widely-used benchmarks demonstrate that the MLLMs trained with Math-PUMA outperform most open-source models. |
| Researcher Affiliation | Academia | 1University of Chinese Academy of Sciences 2Beijing Institute of Technology 3Beijing University of Aeronautics and Astronautics EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology and training stages in text and provides mathematical equations for loss calculation, but it does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/wwzhuang01/Math-PUMA |
| Open Datasets | Yes | We curate a large-scale dataset, Math-PUMA-1M, which comprises 692K data pairs and 996K multimodal mathematical data. This dataset serves as a valuable resource for model training. ... Specifically, we focus on enriching the geometric problem subset within Math V360K, expanding it from 40K to 120K in order to address the scarcity of geometric data. Furthermore, as referenced in (Lu et al. 2024a), we incorporate a balanced amount of textual data to mitigate potential modality imbalances and enhance the model s overall performance. |
| Dataset Splits | Yes | We conduct extensive experiments on three popular multimodal mathematical problem-solving benchmarks: MATHVERSE (Zhang et al. 2024a), MATHVISTA (Lu et al. 2024b), and WE-MATH (Qiao et al. 2024). ... For MATHVERSE and MATHVISTA, initially, we use GPT-4o-mini (Open AI 2024a) to extract answers from the responses generated by MLLMs. Subsequently, we employ GPT-4o-mini once more to verify the correctness of the extracted answers. |
| Hardware Specification | Yes | Our experiments are conducted using Py Torch version 2.1.0 and CUDA 12.1, utilizing 32 NVIDIA A100 GPUs with 80GB memory each. |
| Software Dependencies | Yes | Our experiments are conducted using Py Torch version 2.1.0 and CUDA 12.1, utilizing 32 NVIDIA A100 GPUs with 80GB memory each. We employ the Adam W optimizer (Kingma and Ba 2014), configured with β1 = 0.9 and β2 = 0.999. |
| Experiment Setup | Yes | The learning rate is adjusted across three stages: 3e-5 for stage 1, 5e-5 for stage 2, and 3e-5 for stage 3. A cosine learning rate schedule is implemented with a warm-up phase covering 2% of the total training steps. Additionally, a decay rate of 0.1 is applied. The KL divergence is controlled using specific hyperparameters: αKL is set to 0.2, τ to 1.0, and λKL to 0.1. The training is conducted over 1 epoch. The batch sizes for three stages are 256, 512, and 256, respectively. |