OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance

Authors: Yongqiang Yao, Jingru Tan, Feizhao Zhang, Jiahao Hu, Yazhe Niu, Jin Xin, Bo Li, Pengfei Liu, Ruihao Gong, Dahua Lin, Ningyi Xu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted to validate the effectiveness of our method. Compared with the open-source training code of Intern VL-Chat, training time is reduced greatly, achieving about 1.8 speed-up. Our method s efficacy and generalizability are further validated across various models and datasets.
Researcher Affiliation Collaboration 1Shanghai Jiao Tong University 2Central South University 3Sense Time Research 4The Chinese University of Hong Kong 5Tongji University 6Beihang University.
Pseudocode Yes Algorithm 1 ISF: Iterative Sampling and Filtering
Open Source Code No Codes will be released at https://github.com/Model TC/Omni Bal.
Open Datasets Yes We conduct experiments following the open-source Intern VL-Chat-1.5 setting. Our vision and language models are Intern Vi T-6B and Intern LM2-20B, respectively. Two configurations are employed: Intern VL-Chat-1.5 (6+20B) and Intern VL-Chat-1.5-Plus (6+34B). As the Intern VL-Chat-1.5 dataset is not yet available, we utilize the Intern VL-Chat-1.2 dataset, which comprises approximately 1.2 million samples, as an alternative. All other training settings remain unchanged. GPU Days are our evaluation metric to estimate the total training time. Specifically, GPU Days are reported based on A100 GPU usage to evaluate the speed-up performance. We consistently achieved a low Dist Ratio across the LLava-665K, Intern VL-1.2M, and LCS558K datasets, as demonstrated in Table 8. Additionally, our approach significantly enhanced training speed.
Dataset Splits No The paper mentions using datasets like "Intern VL-Chat-1.2 dataset" and "LLava-665K", but it does not specify explicit training, validation, or test splits (e.g., percentages, sample counts, or references to predefined splits) needed for reproduction. It only states the total number of samples for some datasets.
Hardware Specification Yes GPU Days are reported based on A100 GPU usage to evaluate the speed-up performance. We test our method on various hardware platforms with different GPUs (e.g., A100, H100) and network bandwidths. The experiments in Table 15 confirmed consistent performance improvements across all platforms.
Software Dependencies No The paper mentions using "Megatron-Deep Speed" as a backend, but it does not provide specific version numbers for this or any other software libraries, programming languages, or tools used in the experimental setup.
Experiment Setup Yes How to get Qv and Qt. We determine Qv and Qt using dataset statistics. The total text token lengths and image count are used to compute the average tokens per image. Qt is set to the longest text token length, and the text-to-image ratio determines Qv. It is challenging to maintain an exact number of images and text length, so we relax these conditions to allow for approximation. For images, Q v = Qv, and for text, Q t = Qt 128 based on results in Table 6. In the Intern VL-Chat-1.2 dataset, Qt = 4K, Qv = 9, with each image processed into 1K tokens for VIT. Table 9. Results for different model sizes are shown, with TP, PP, and DP representing various distributed training strategies: Tensor Parallel (TP), Pipeline Parallel (PP), and Data Parallel (DP), respectively. The Stages-Layer-Num (V+L) column indicates the number of Vision Transformer (V) and Language Transformer (L) layers assigned to each stage. Additionally, the Re-computation column denotes the number of re-computations enabled in each stage.