Scaling Capability in Token Space: An Analysis of Large Vision Language Model

Authors: Tenghui Li, Guoxu Zhou, Xuyang Zhao, Qibin Zhao

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical validations across multiple vision-language benchmarks show that model performance matches the prediction from scaling relationship. The findings contribute to understanding vision token scaling in transformers through a theoretical framework that complements empirical observations. This section systematically evaluates the proposed model across multiple vision language benchmarks to investigate the relationship between the number of vision tokens and performance, providing empirical support for the scaling behavior established in Section 3.
Researcher Affiliation Academia Tenghui Li EMAIL School of Automation, Guangdong University of Technology, Guangzhou 510006, China Guoxu Zhou EMAIL School of Automation, Guangdong University of Technology, Guangzhou 510006, China Key Laboratory of Intelligent Detection and the Internet of Things in Manufacturing, Ministry of Education, Guangzhou, China Guangdong Provincial Key Laboratory of Intelligent Systems and Optimization Integration(GDUT), Guangzhou 510006, China Xuyang Zhao EMAIL Medical Science Data-driven Mathematics Team, RIKEN Center for Interdisciplinary Theoretical and Mathematical Sciences, Yokohama 230-0045, Japan Medical Data Mathematical Reasoning Special Team, RIKEN Center for Integrative Medical Sciences, Yokohama 230-0045, Japan Department of Artificial Intelligence Medicine, Chiba University, Chiba 260-0856, Japan Qibin Zhao EMAIL Tensor Learning Team, RIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan School of Automation, Guangdong University of Technology, Guangzhou 510006, China
Pseudocode No The paper only describes methodologies using natural language and mathematical formulations, without presenting any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/tenghuilee/Scaling_Cap_Fused_Vision_LM.git
Open Datasets Yes For this empirical analysis, dataset ocra-dpo-pairs 1 is utilized. 1. https://huggingface.co/datasets/Intel/orca_dpo_pairs The primary datasets employed include LLAVA V1.5 MIX665K (Liu et al., 2023), BAAI-SVIT (Zhao et al., 2023), and mPLUG Doc Downstream 1.0 (Ye et al., 2023).
Dataset Splits Yes To reduce the substantial time required for full-data fine-tuning, we reuse the model trained in the second step (nl = 256, ns = 8) and fine-tune it using only 10% of randomly sampled training data, which will take approximately 10 hours. The VLMEval Kit (Duan et al., 2024) is employed to ensure the standardization and reproducibility of evaluations across multiple vision language benchmarks, including MME (Fu et al., 2025), Hallusion Bench (Guan et al., 2024), POPE (Li et al., 2023c), and other benchmarks.
Hardware Specification Yes The experiments were conducted on high-performance hardware comprising 8 NVIDIA A100 GPUs, each with 40 GB of memory. For evaluation on more accessible hardware, we utilized NVIDIA RTX A6000 GPUs, each with 48 GB of memory. The GPU utilization and memory usage are measured on a single NVIDIA RTX 3090 GPU (24G).
Software Dependencies No The paper mentions software components like CLIP ViT-H/14, Llama-2 7B, and SciPy, but does not provide specific version numbers for these or other key software dependencies (e.g., Python, PyTorch, CUDA), which are essential for reproducible descriptions.
Experiment Setup Yes In the first step, we perform preliminary training using the contrastive loss introduced in Equation 35 to align the fused vision tokens with the CLIP text encoder. Since the maximum sequence length of the CLIP text encoder is 77 and most questions exceed this limit, we extend the sequence length to 512 to accommodate longer inputs. The modules involved in this step include the vision encoder, the 2D convolution layer for merging neighboring vision tokens, the fused model, and the CLIP text encoder. This step requires only a few training steps, using a batch size of 32 per device, gradient accumulation of 1, an equivalent batch size of 256, and a learning rate of 2e-5. The training employs a cosine learning rate scheduler with a warm-up ratio of 0.1 and is performed over 1000 steps. The second step involves fine-tuning on the full dataset. The required modules include the vision encoder, the 2D convolution layer, the fused model, a linear projection layer to map the fused vision tokens to the Llama-2 hidden size, and the Llama-2 7B backbone. In this stage, the Llama-2 backbone is frozen, and the remaining modules are updated. The training uses a batch size of 5 per device, gradient accumulation of 64, and an equivalent batch size of 2560, with a learning rate of 2e-5. A cosine learning rate scheduler with a warm-up ratio of 0.03 is employed. The training spans two epochs and uses the generation loss described in Equation 36.