reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

Authors: Wenwen Zhuang, Xin Huang, Xiantao Zhang, Jin Zeng

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on multiple mathematical reasoning benchmarks demonstrate that the MLLMs trained with Math-PUMA surpass most open-source MLLMs. Our approach effectively narrows the performance gap for problems presented in different modalities. Experimental results on three widely-used benchmarks demonstrate that the MLLMs trained with Math-PUMA outperform most open-source models.
Researcher Affiliation	Academia	1University of Chinese Academy of Sciences 2Beijing Institute of Technology 3Beijing University of Aeronautics and Astronautics EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology and training stages in text and provides mathematical equations for loss calculation, but it does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/wwzhuang01/Math-PUMA
Open Datasets	Yes	We curate a large-scale dataset, Math-PUMA-1M, which comprises 692K data pairs and 996K multimodal mathematical data. This dataset serves as a valuable resource for model training. ... Specifically, we focus on enriching the geometric problem subset within Math V360K, expanding it from 40K to 120K in order to address the scarcity of geometric data. Furthermore, as referenced in (Lu et al. 2024a), we incorporate a balanced amount of textual data to mitigate potential modality imbalances and enhance the model s overall performance.
Dataset Splits	Yes	We conduct extensive experiments on three popular multimodal mathematical problem-solving benchmarks: MATHVERSE (Zhang et al. 2024a), MATHVISTA (Lu et al. 2024b), and WE-MATH (Qiao et al. 2024). ... For MATHVERSE and MATHVISTA, initially, we use GPT-4o-mini (Open AI 2024a) to extract answers from the responses generated by MLLMs. Subsequently, we employ GPT-4o-mini once more to verify the correctness of the extracted answers.
Hardware Specification	Yes	Our experiments are conducted using Py Torch version 2.1.0 and CUDA 12.1, utilizing 32 NVIDIA A100 GPUs with 80GB memory each.
Software Dependencies	Yes	Our experiments are conducted using Py Torch version 2.1.0 and CUDA 12.1, utilizing 32 NVIDIA A100 GPUs with 80GB memory each. We employ the Adam W optimizer (Kingma and Ba 2014), configured with β1 = 0.9 and β2 = 0.999.
Experiment Setup	Yes	The learning rate is adjusted across three stages: 3e-5 for stage 1, 5e-5 for stage 2, and 3e-5 for stage 3. A cosine learning rate schedule is implemented with a warm-up phase covering 2% of the total training steps. Additionally, a decay rate of 0.1 is applied. The KL divergence is controlled using specific hyperparameters: αKL is set to 0.2, τ to 1.0, and λKL to 0.1. The training is conducted over 1 epoch. The batch sizes for three stages are 256, 512, and 256, respectively.