MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
Authors: Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Shanghang Zhang, Gao Peng, Hongsheng Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On various mathematical benchmarks, our MAVIS-7B achieves leading results among open-source MLLMs, e.g., surpassing other 7B models by +9.3% and the second-best LLa VANe XT (110B) by +6.9%, demonstrating the effectiveness of our method. Data and models are released at https://github.com/ZrrSkywalker/MAVIS. ... We evaluate our model MAVIS-7B on several popular mathematical benchmarks, Math Verse (Zhang et al., 2024b), Geo QA (Chen et al., 2021c), Function QA (function problems in Math Vista (Lu et al., 2023)), MMMU-Math (the math problems in MMMU (Yue et al., 2023a)), Math Vision (Wang et al., 2024b), three mathematical categories in Math Vista, and We-Math (Qiao et al., 2024). We compare a variety of existing MLLMs... |
| Researcher Affiliation | Academia | 1CUHK MMLab & 2Miu Lar Lab 3Peking University 4Shanghai AI Laboratory 5CPII under Inno HK EMAIL, EMAIL |
| Pseudocode | No | The paper describes the data generation process and training pipeline in natural language and flowcharts (Figure 2), and uses mathematical formulations (Equations 1-3 in Section A.4.1), but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Data and models are released at https://github.com/ZrrSkywalker/MAVIS. |
| Open Datasets | Yes | With this approach, we curate two datasets, MAVIS-Caption (558K diagram-caption pairs) and MAVIS-Instruct (834K visual math problems with Co T rationales), and propose four progressive stages for training MLLMs from scratch. ... Data and models are released at https://github.com/ZrrSkywalker/MAVIS. |
| Dataset Splits | Yes | We evaluate our model MAVIS-7B on several popular mathematical benchmarks, Math Verse (Zhang et al., 2024b), Geo QA (Chen et al., 2021c), Function QA (function problems in Math Vista (Lu et al., 2023)), MMMU-Math (the math problems in MMMU (Yue et al., 2023a)), Math Vision (Wang et al., 2024b), three mathematical categories in Math Vista, and We-Math (Qiao et al., 2024). ... we conduct an ablation study on the 834K MAVIS-Instruct dataset by randomly sampling 25%, 50%, and 75% of the data for instruction tuning, excluding the DPO stage. |
| Hardware Specification | No | The paper does not explicitly mention specific hardware details such as GPU or CPU models used for running the experiments. |
| Software Dependencies | No | The logic of the data engine is implemented in Python, and we employ Matplotlib for the graphical rendering of the diagrams. However, specific version numbers for Python, Matplotlib, or other software libraries are not provided. |
| Experiment Setup | Yes | In the first stage, we fine-tune the CLIP for 10 epochs with a batch size 16 and an initial learning rate 2e-6. In the second stage, we train the diagram-language alignment for 1 epoch with a batch size 32 and an initial learning rate 2e-6, and adopt Lo RA (Hu et al., 2021) with a rank 128. In the third and fourth stages, we adopt the same training settings as the second one. |