MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine

Authors: Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Shanghang Zhang, Gao Peng, Hongsheng Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On various mathematical benchmarks, our MAVIS-7B achieves leading results among open-source MLLMs, e.g., surpassing other 7B models by +9.3% and the second-best LLa VANe XT (110B) by +6.9%, demonstrating the effectiveness of our method. Data and models are released at https://github.com/ZrrSkywalker/MAVIS. ... We evaluate our model MAVIS-7B on several popular mathematical benchmarks, Math Verse (Zhang et al., 2024b), Geo QA (Chen et al., 2021c), Function QA (function problems in Math Vista (Lu et al., 2023)), MMMU-Math (the math problems in MMMU (Yue et al., 2023a)), Math Vision (Wang et al., 2024b), three mathematical categories in Math Vista, and We-Math (Qiao et al., 2024). We compare a variety of existing MLLMs...
Researcher Affiliation Academia 1CUHK MMLab & 2Miu Lar Lab 3Peking University 4Shanghai AI Laboratory 5CPII under Inno HK EMAIL, EMAIL
Pseudocode No The paper describes the data generation process and training pipeline in natural language and flowcharts (Figure 2), and uses mathematical formulations (Equations 1-3 in Section A.4.1), but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Data and models are released at https://github.com/ZrrSkywalker/MAVIS.
Open Datasets Yes With this approach, we curate two datasets, MAVIS-Caption (558K diagram-caption pairs) and MAVIS-Instruct (834K visual math problems with Co T rationales), and propose four progressive stages for training MLLMs from scratch. ... Data and models are released at https://github.com/ZrrSkywalker/MAVIS.
Dataset Splits Yes We evaluate our model MAVIS-7B on several popular mathematical benchmarks, Math Verse (Zhang et al., 2024b), Geo QA (Chen et al., 2021c), Function QA (function problems in Math Vista (Lu et al., 2023)), MMMU-Math (the math problems in MMMU (Yue et al., 2023a)), Math Vision (Wang et al., 2024b), three mathematical categories in Math Vista, and We-Math (Qiao et al., 2024). ... we conduct an ablation study on the 834K MAVIS-Instruct dataset by randomly sampling 25%, 50%, and 75% of the data for instruction tuning, excluding the DPO stage.
Hardware Specification No The paper does not explicitly mention specific hardware details such as GPU or CPU models used for running the experiments.
Software Dependencies No The logic of the data engine is implemented in Python, and we employ Matplotlib for the graphical rendering of the diagrams. However, specific version numbers for Python, Matplotlib, or other software libraries are not provided.
Experiment Setup Yes In the first stage, we fine-tune the CLIP for 10 epochs with a batch size 16 and an initial learning rate 2e-6. In the second stage, we train the diagram-language alignment for 1 epoch with a batch size 32 and an initial learning rate 2e-6, and adopt Lo RA (Hu et al., 2021) with a rank 128. In the third and fourth stages, we adopt the same training settings as the second one.