MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Authors: Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development. |
| Researcher Affiliation | Industry | Haotian Zhang , Mingfei Gao , Zhe Gan , Philipp Dufter , Nina Wenzel , Forrest Huang , Dhruti Shah , Xianzhi Du , Bowen Zhang , Yanghao Li , Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch , Yinfei Yang Apple EMAIL |
| Pseudocode | No | The paper describes methods and training recipes but does not include any explicitly labeled pseudocode or algorithm blocks. For example, Figure 1 presents a 'Recipe for building MM1.5' but it is a diagram, not pseudocode. |
| Open Source Code | No | 2All models are trained using the AXLearn framework https://github.com/apple/axlearn. This refers to the framework used for training, not the specific implementation code for MM1.5 itself. |
| Open Datasets | Yes | LLa VA 1.5 conversation Liu et al. (2023a) (56.7k) LLa VA v1.5 VQAv2 OKVQA Marino et al. (2019); Liu et al. (2023a) (91.8k) Coco Captions Chen et al. (2015) (82.8k) Text Caps Sidorov et al. (2020) (22k) |
| Dataset Splits | Yes | Evaluation benchmarks. We group our benchmarks into categories based on what capabilities a benchmark primarily measures. Our benchmark groups include general, text-rich, refer&ground, knowledge, and multi-image. See Table 6 in Appendix A.4 for more details. We propose Category Average Score, the average score of all benchmark numbers for each sub-category, to represent the average performance on that capability. We focus on the categories of general, text-rich, and knowledge, as these capabilities are widely considered essential for MLLMs. To evaluate a model s impact on these capabilities, we refer to a MMBase score, defined as the average scores on general, text-rich, and knowledge categories. Details of the evaluation metrics are provided in Appendix A.4. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU models or CPU processors. |
| Software Dependencies | No | 2All models are trained using the AXLearn framework https://github.com/apple/axlearn. This mentions a framework but does not provide specific version numbers for other key software components or libraries. |
| Experiment Setup | Yes | For both continual pre-training and SFT, we set the batch size as 256. We use the Ada Factor optimizer with a peak learning rate of 1e-5 and a cosine decay of 0. For continual pre-training, we train a maximum of 30k steps. During SFT, all models are optimized for one epoch. For pre-training, we follow the exact same learning rate schedule as in MM1 (Mc Kinzie et al., 2024) and 200k training steps with sequence length 4096. For continual pre-training, we use a peak learning rate of 1e-5 with the cosine decay and 30k training steps for all the models (from 1B to 30B). For SFT, we use a peak learning rate of 2e-5 and 23k training steps for all the models. |