Diving into Self-Evolving Training for Multimodal Reasoning
Authors: Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results on 5 different multimodal reasoning benchmarks, including Math Vista, M3Co T, MMStar, MMBench and AI2D, show that this strategy, which incorporates both optimized static design choices and dynamic adjustments, effectively mitigates exploration loss during training and enhances performance universally for models with varied sizes such as Mini CPMV-2.5 (8B), Phi-3.5-Vision (4B) and Intern VL2 (2B). |
| Researcher Affiliation | Collaboration | 1The Hong Kong University of Science and Technology 2Shanghai Jiao Tong University 3Helixon Research 4The Chinese University of Hong Kong. |
| Pseudocode | No | The paper describes the M-STAR algorithm and its components through textual explanations and flowcharts (Figure 1), but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | All resources are made publicly available at https://mstar-lmm.github.io. |
| Open Datasets | Yes | Datasets We utilize Math V360K (Shi et al., 2024), a high-quality and diverse multimodal reasoning dataset as our seed training dataset. ... For our outof-domain (OOD) testset we use the testmini split of Math Vista (Lu et al., 2023), a comprehensive benchmark encompassing a wide range of multimodal reasoning tasks... M3Co T (Chen et al., 2024b), MMStar (Chen et al., 2024a), MMBench (Dev set, v1.1) (Liu et al., 2025), AI2D (Kembhavi et al., 2016). |
| Dataset Splits | Yes | Specifically, we downsample half of the examples (180K) from it to serve as our labeled training set, while setting aside the remaining half as a unlabeled training set by not using the answers in it. For evaluation, we split 750 samples from the unlabeled part of Math V360K as the in-domain (ID) testset. For our outof-domain (OOD) testset we use the testmini split of Math Vista (Lu et al., 2023)... We also keep an non-overlapping 250 samples from Math V360K as the global validation set in training. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used for running the experiments. It mentions various LMMs (Mini CPM-V-2.5, Phi-3.5-Vision, Intern VL2) and computational costs, but no hardware specifications. |
| Software Dependencies | No | The paper mentions several models and frameworks like "LLa MA-3-8B", "Sig LIP", "CLIP Vi T-L/14", "Intern LM-2-Chat" and cites the corresponding papers. However, it does not provide specific version numbers for software libraries, programming languages, or operating systems that would be necessary to replicate the experiments. |
| Experiment Setup | Yes | We adopt most of the training settings from Yao et al. (2024) (see Appendix C), using a constant learning rate of 1e 6 and training for 10K steps across all experiments. During all rollout phases in training, we sample 16 responses per query and set the sampling temperature to 1.0. ... We follow the training setup from Yao et al. (2024), using a learning rate of 1e-6 and a batch size of 128. A constant learning rate scheduler with a warmup ratio of 0.1 is applied. |