Diving into Self-Evolving Training for Multimodal Reasoning

Authors: Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results on 5 different multimodal reasoning benchmarks, including Math Vista, M3Co T, MMStar, MMBench and AI2D, show that this strategy, which incorporates both optimized static design choices and dynamic adjustments, effectively mitigates exploration loss during training and enhances performance universally for models with varied sizes such as Mini CPMV-2.5 (8B), Phi-3.5-Vision (4B) and Intern VL2 (2B).
Researcher Affiliation Collaboration 1The Hong Kong University of Science and Technology 2Shanghai Jiao Tong University 3Helixon Research 4The Chinese University of Hong Kong.
Pseudocode No The paper describes the M-STAR algorithm and its components through textual explanations and flowcharts (Figure 1), but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes All resources are made publicly available at https://mstar-lmm.github.io.
Open Datasets Yes Datasets We utilize Math V360K (Shi et al., 2024), a high-quality and diverse multimodal reasoning dataset as our seed training dataset. ... For our outof-domain (OOD) testset we use the testmini split of Math Vista (Lu et al., 2023), a comprehensive benchmark encompassing a wide range of multimodal reasoning tasks... M3Co T (Chen et al., 2024b), MMStar (Chen et al., 2024a), MMBench (Dev set, v1.1) (Liu et al., 2025), AI2D (Kembhavi et al., 2016).
Dataset Splits Yes Specifically, we downsample half of the examples (180K) from it to serve as our labeled training set, while setting aside the remaining half as a unlabeled training set by not using the answers in it. For evaluation, we split 750 samples from the unlabeled part of Math V360K as the in-domain (ID) testset. For our outof-domain (OOD) testset we use the testmini split of Math Vista (Lu et al., 2023)... We also keep an non-overlapping 250 samples from Math V360K as the global validation set in training.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used for running the experiments. It mentions various LMMs (Mini CPM-V-2.5, Phi-3.5-Vision, Intern VL2) and computational costs, but no hardware specifications.
Software Dependencies No The paper mentions several models and frameworks like "LLa MA-3-8B", "Sig LIP", "CLIP Vi T-L/14", "Intern LM-2-Chat" and cites the corresponding papers. However, it does not provide specific version numbers for software libraries, programming languages, or operating systems that would be necessary to replicate the experiments.
Experiment Setup Yes We adopt most of the training settings from Yao et al. (2024) (see Appendix C), using a constant learning rate of 1e 6 and training for 10K steps across all experiments. During all rollout phases in training, we sample 16 responses per query and set the sampling temperature to 1.0. ... We follow the training setup from Yao et al. (2024), using a learning rate of 1e-6 and a batch size of 128. A constant learning rate scheduler with a warmup ratio of 0.1 is applied.