LLaVA-OneVision: Easy Visual Task Transfer

Authors: Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that LLa VA-One Vision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLa VA-One Vision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos. We conduct standardized and reproducible evaluations for LLa VA-One Vision models on all benchmarks using LMMs-Eval (Zhang et al., 2024b).
Researcher Affiliation Collaboration Bo Li EMAIL S-Lab, Nanyang Technological University Yuanhan Zhang EMAIL S-Lab, Nanyang Technological University Dong Guo EMAIL Byte Dance Renrui Zhang EMAIL Chinese University of Hong Kong Feng Li EMAIL Hong Kong University of Science and Technology Hao Zhang EMAIL Hong Kong University of Science and Technology Kaichen Zhang EMAIL S-Lab, Nanyang Technological University Peiyuan Zhang EMAIL S-Lab, Nanyang Technological University Yanwei Li EMAIL Byte Dance Ziwei Liu EMAIL S-Lab, Nanyang Technological University Chunyuan Li EMAIL Byte Dance
Pseudocode No The paper includes network architecture diagrams (Figure 1, 2, 3) and equations, but no structured pseudocode or algorithm blocks are explicitly presented.
Open Source Code Yes Open-source. To pave the way towards building a general-purpose visual assistant, we release the following assets to the public: the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo.
Open Datasets Yes Open-source. To pave the way towards building a general-purpose visual assistant, we release the following assets to the public: the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo. We used the model to generate new captions for the images from the following datasets: COCO118K, BLIP558K, and CC3M. We utilized the Text Reading subset from the UReader dataset, totaling 100K, which is easily accessible through PDF rendering.
Dataset Splits Yes Single-Image Training: The model is first trained on 3.2 million single-image instructions, resulting in a model with strong performance in following a diverse set of instructions to complete visual tasks using a single image. (ii) One Vision Training: The model is then trained on a mixture of video, single-image, and multi-image data. In this phase, the model expands its capabilities from single-image scenarios to diverse scenarios. It learns to follow instructions to complete tasks in each new scenario and transfer the learned knowledge across different scenarios, resulting in new emergent capabilities. We introduce a total of 1.6 million mixed data samples, comprising 560K multi-image data from Li et al. (2024d), 350K videos collected in this project, and 800K single-image samples. We conduct standardized and reproducible evaluations for LLa VA-One Vision models on all benchmarks using LMMs-Eval (Zhang et al., 2024b). For fair comparison with other leading LMMs, we primarily report results from original papers. When results are unavailable, we onboard the models in LMMs-Eval and evaluate them using consistent settings. All our results are reported with greedy decoding and 0-shot settings unless otherwise specified.
Hardware Specification No The paper discusses 'fixed compute budget' and 'increased computational resources' but does not specify any particular hardware (e.g., GPU models, CPU types) used for the experiments.
Software Dependencies No We choose Qwen-2 (Yang et al., 2024) as our LLM fϕ( ) parameterized by ϕ... We consider the Sig LIP (Zhai et al., 2023) as the visual encoder gψ( )... We utilize the Qwen-2 series Yang et al. (2024) language models with the template as Open AI’s Chat ML1. The paper mentions specific models (Qwen-2, SigLIP) and refers to Open AI's Chat ML template but does not list general software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Table 1: Detailed configuration for each training stage of the LLa VA-One Vision model. The table outlines the progression of vision parameters, dataset characteristics, model specifications, and training hyperparameters across different stages of the curriculum learning process. We use a global batch size of 512 for the 0.5B model, and 256 for the 7B and 72B models. ... Batch Size 512 256/512 256/512 256/512 LR: ψvision 1 10 3 2 10 6 2 10 6 2 10 6 LR: {θproj, ϕLLM} 1 10 3 1 10 5 1 10 5 1 10 5 Epoch 1 1 1 1 ... Regarding trainable modules, Stage-1 updates only the projector, while the subsequent stages update the full model.