reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance

Authors: Yongqiang Yao, Jingru Tan, Feizhao Zhang, Jiahao Hu, Yazhe Niu, Jin Xin, Bo Li, Pengfei Liu, Ruihao Gong, Dahua Lin, Ningyi Xu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted to validate the effectiveness of our method. Compared with the open-source training code of Intern VL-Chat, training time is reduced greatly, achieving about 1.8 speed-up. Our method s efficacy and generalizability are further validated across various models and datasets.
Researcher Affiliation	Collaboration	1Shanghai Jiao Tong University 2Central South University 3Sense Time Research 4The Chinese University of Hong Kong 5Tongji University 6Beihang University.
Pseudocode	Yes	Algorithm 1 ISF: Iterative Sampling and Filtering
Open Source Code	No	Codes will be released at https://github.com/Model TC/Omni Bal.
Open Datasets	Yes	We conduct experiments following the open-source Intern VL-Chat-1.5 setting. Our vision and language models are Intern Vi T-6B and Intern LM2-20B, respectively. Two configurations are employed: Intern VL-Chat-1.5 (6+20B) and Intern VL-Chat-1.5-Plus (6+34B). As the Intern VL-Chat-1.5 dataset is not yet available, we utilize the Intern VL-Chat-1.2 dataset, which comprises approximately 1.2 million samples, as an alternative. All other training settings remain unchanged. GPU Days are our evaluation metric to estimate the total training time. Specifically, GPU Days are reported based on A100 GPU usage to evaluate the speed-up performance. We consistently achieved a low Dist Ratio across the LLava-665K, Intern VL-1.2M, and LCS558K datasets, as demonstrated in Table 8. Additionally, our approach significantly enhanced training speed.
Dataset Splits	No	The paper mentions using datasets like "Intern VL-Chat-1.2 dataset" and "LLava-665K", but it does not specify explicit training, validation, or test splits (e.g., percentages, sample counts, or references to predefined splits) needed for reproduction. It only states the total number of samples for some datasets.
Hardware Specification	Yes	GPU Days are reported based on A100 GPU usage to evaluate the speed-up performance. We test our method on various hardware platforms with different GPUs (e.g., A100, H100) and network bandwidths. The experiments in Table 15 confirmed consistent performance improvements across all platforms.
Software Dependencies	No	The paper mentions using "Megatron-Deep Speed" as a backend, but it does not provide specific version numbers for this or any other software libraries, programming languages, or tools used in the experimental setup.
Experiment Setup	Yes	How to get Qv and Qt. We determine Qv and Qt using dataset statistics. The total text token lengths and image count are used to compute the average tokens per image. Qt is set to the longest text token length, and the text-to-image ratio determines Qv. It is challenging to maintain an exact number of images and text length, so we relax these conditions to allow for approximation. For images, Q v = Qv, and for text, Q t = Qt 128 based on results in Table 6. In the Intern VL-Chat-1.2 dataset, Qt = 4K, Qv = 9, with each image processed into 1K tokens for VIT. Table 9. Results for different model sizes are shown, with TP, PP, and DP representing various distributed training strategies: Tensor Parallel (TP), Pipeline Parallel (PP), and Data Parallel (DP), respectively. The Stages-Layer-Num (V+L) column indicates the number of Vision Transformer (V) and Language Transformer (L) layers assigned to each stage. Additionally, the Re-computation column denotes the number of re-computations enabled in each stage.