reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLaVA-OneVision: Easy Visual Task Transfer

Authors: Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results demonstrate that LLa VA-One Vision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLa VA-One Vision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos. We conduct standardized and reproducible evaluations for LLa VA-One Vision models on all benchmarks using LMMs-Eval (Zhang et al., 2024b).
Researcher Affiliation	Collaboration	Bo Li EMAIL S-Lab, Nanyang Technological University Yuanhan Zhang EMAIL S-Lab, Nanyang Technological University Dong Guo EMAIL Byte Dance Renrui Zhang EMAIL Chinese University of Hong Kong Feng Li EMAIL Hong Kong University of Science and Technology Hao Zhang EMAIL Hong Kong University of Science and Technology Kaichen Zhang EMAIL S-Lab, Nanyang Technological University Peiyuan Zhang EMAIL S-Lab, Nanyang Technological University Yanwei Li EMAIL Byte Dance Ziwei Liu EMAIL S-Lab, Nanyang Technological University Chunyuan Li EMAIL Byte Dance
Pseudocode	No	The paper includes network architecture diagrams (Figure 1, 2, 3) and equations, but no structured pseudocode or algorithm blocks are explicitly presented.
Open Source Code	Yes	Open-source. To pave the way towards building a general-purpose visual assistant, we release the following assets to the public: the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo.
Open Datasets	Yes	Open-source. To pave the way towards building a general-purpose visual assistant, we release the following assets to the public: the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo. We used the model to generate new captions for the images from the following datasets: COCO118K, BLIP558K, and CC3M. We utilized the Text Reading subset from the UReader dataset, totaling 100K, which is easily accessible through PDF rendering.
Dataset Splits	Yes	Single-Image Training: The model is first trained on 3.2 million single-image instructions, resulting in a model with strong performance in following a diverse set of instructions to complete visual tasks using a single image. (ii) One Vision Training: The model is then trained on a mixture of video, single-image, and multi-image data. In this phase, the model expands its capabilities from single-image scenarios to diverse scenarios. It learns to follow instructions to complete tasks in each new scenario and transfer the learned knowledge across different scenarios, resulting in new emergent capabilities. We introduce a total of 1.6 million mixed data samples, comprising 560K multi-image data from Li et al. (2024d), 350K videos collected in this project, and 800K single-image samples. We conduct standardized and reproducible evaluations for LLa VA-One Vision models on all benchmarks using LMMs-Eval (Zhang et al., 2024b). For fair comparison with other leading LMMs, we primarily report results from original papers. When results are unavailable, we onboard the models in LMMs-Eval and evaluate them using consistent settings. All our results are reported with greedy decoding and 0-shot settings unless otherwise specified.
Hardware Specification	No	The paper discusses 'fixed compute budget' and 'increased computational resources' but does not specify any particular hardware (e.g., GPU models, CPU types) used for the experiments.
Software Dependencies	No	We choose Qwen-2 (Yang et al., 2024) as our LLM fϕ( ) parameterized by ϕ... We consider the Sig LIP (Zhai et al., 2023) as the visual encoder gψ( )... We utilize the Qwen-2 series Yang et al. (2024) language models with the template as Open AI’s Chat ML1. The paper mentions specific models (Qwen-2, SigLIP) and refers to Open AI's Chat ML template but does not list general software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Table 1: Detailed configuration for each training stage of the LLa VA-One Vision model. The table outlines the progression of vision parameters, dataset characteristics, model specifications, and training hyperparameters across different stages of the curriculum learning process. We use a global batch size of 512 for the 0.5B model, and 256 for the 7B and 72B models. ... Batch Size 512 256/512 256/512 256/512 LR: ψvision 1 10 3 2 10 6 2 10 6 2 10 6 LR: {θproj, ϕLLM} 1 10 3 1 10 5 1 10 5 1 10 5 Epoch 1 1 1 1 ... Regarding trainable modules, Stage-1 updates only the projector, while the subsequent stages update the full model.