reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Authors: Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.
Researcher Affiliation	Industry	Haotian Zhang , Mingfei Gao , Zhe Gan , Philipp Dufter , Nina Wenzel , Forrest Huang , Dhruti Shah , Xianzhi Du , Bowen Zhang , Yanghao Li , Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch , Yinfei Yang Apple EMAIL
Pseudocode	No	The paper describes methods and training recipes but does not include any explicitly labeled pseudocode or algorithm blocks. For example, Figure 1 presents a 'Recipe for building MM1.5' but it is a diagram, not pseudocode.
Open Source Code	No	2All models are trained using the AXLearn framework https://github.com/apple/axlearn. This refers to the framework used for training, not the specific implementation code for MM1.5 itself.
Open Datasets	Yes	LLa VA 1.5 conversation Liu et al. (2023a) (56.7k) LLa VA v1.5 VQAv2 OKVQA Marino et al. (2019); Liu et al. (2023a) (91.8k) Coco Captions Chen et al. (2015) (82.8k) Text Caps Sidorov et al. (2020) (22k)
Dataset Splits	Yes	Evaluation benchmarks. We group our benchmarks into categories based on what capabilities a benchmark primarily measures. Our benchmark groups include general, text-rich, refer&ground, knowledge, and multi-image. See Table 6 in Appendix A.4 for more details. We propose Category Average Score, the average score of all benchmark numbers for each sub-category, to represent the average performance on that capability. We focus on the categories of general, text-rich, and knowledge, as these capabilities are widely considered essential for MLLMs. To evaluate a model s impact on these capabilities, we refer to a MMBase score, defined as the average scores on general, text-rich, and knowledge categories. Details of the evaluation metrics are provided in Appendix A.4.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU models or CPU processors.
Software Dependencies	No	2All models are trained using the AXLearn framework https://github.com/apple/axlearn. This mentions a framework but does not provide specific version numbers for other key software components or libraries.
Experiment Setup	Yes	For both continual pre-training and SFT, we set the batch size as 256. We use the Ada Factor optimizer with a peak learning rate of 1e-5 and a cosine decay of 0. For continual pre-training, we train a maximum of 30k steps. During SFT, all models are optimized for one epoch. For pre-training, we follow the exact same learning rate schedule as in MM1 (Mc Kinzie et al., 2024) and 200k training steps with sequence length 4096. For continual pre-training, we use a peak learning rate of 1e-5 with the cosine decay and 30k training steps for all the models (from 1B to 30B). For SFT, we use a peak learning rate of 2e-5 and 23k training steps for all the models.