reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Diving into Self-Evolving Training for Multimodal Reasoning

Authors: Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results on 5 different multimodal reasoning benchmarks, including Math Vista, M3Co T, MMStar, MMBench and AI2D, show that this strategy, which incorporates both optimized static design choices and dynamic adjustments, effectively mitigates exploration loss during training and enhances performance universally for models with varied sizes such as Mini CPMV-2.5 (8B), Phi-3.5-Vision (4B) and Intern VL2 (2B).
Researcher Affiliation	Collaboration	1The Hong Kong University of Science and Technology 2Shanghai Jiao Tong University 3Helixon Research 4The Chinese University of Hong Kong.
Pseudocode	No	The paper describes the M-STAR algorithm and its components through textual explanations and flowcharts (Figure 1), but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	All resources are made publicly available at https://mstar-lmm.github.io.
Open Datasets	Yes	Datasets We utilize Math V360K (Shi et al., 2024), a high-quality and diverse multimodal reasoning dataset as our seed training dataset. ... For our outof-domain (OOD) testset we use the testmini split of Math Vista (Lu et al., 2023), a comprehensive benchmark encompassing a wide range of multimodal reasoning tasks... M3Co T (Chen et al., 2024b), MMStar (Chen et al., 2024a), MMBench (Dev set, v1.1) (Liu et al., 2025), AI2D (Kembhavi et al., 2016).
Dataset Splits	Yes	Specifically, we downsample half of the examples (180K) from it to serve as our labeled training set, while setting aside the remaining half as a unlabeled training set by not using the answers in it. For evaluation, we split 750 samples from the unlabeled part of Math V360K as the in-domain (ID) testset. For our outof-domain (OOD) testset we use the testmini split of Math Vista (Lu et al., 2023)... We also keep an non-overlapping 250 samples from Math V360K as the global validation set in training.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used for running the experiments. It mentions various LMMs (Mini CPM-V-2.5, Phi-3.5-Vision, Intern VL2) and computational costs, but no hardware specifications.
Software Dependencies	No	The paper mentions several models and frameworks like "LLa MA-3-8B", "Sig LIP", "CLIP Vi T-L/14", "Intern LM-2-Chat" and cites the corresponding papers. However, it does not provide specific version numbers for software libraries, programming languages, or operating systems that would be necessary to replicate the experiments.
Experiment Setup	Yes	We adopt most of the training settings from Yao et al. (2024) (see Appendix C), using a constant learning rate of 1e 6 and training for 10K steps across all experiments. During all rollout phases in training, we sample 16 responses per query and set the sampling temperature to 1.0. ... We follow the training setup from Yao et al. (2024), using a learning rate of 1e-6 and a batch size of 128. A constant learning rate scheduler with a warmup ratio of 0.1 is applied.