LLaVA-Video: Video Instruction Tuning With Synthetic Data
Authors: Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, Chunyuan Li
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We conducted evaluations for the LLaVA-Video models across all benchmarks using LMMs-Eval (Zhang et al., 2024a) to ensure standardization and reproducibility. For full evaluation, we consdier 11 video benchmarks. conducted tests across various video captioning , video open-ended question-answering and video multiple-choice question-answering benchmarks. |
| Researcher Affiliation | Collaboration | Yuanhan Zhang EMAIL S-Lab, Nanyang Technological University Jinming Wu EMAIL BUPT Wei Li EMAIL Byte Dance Bo Li EMAIL S-Lab, Nanyang Technological University Zejun Ma EMAIL Byte Dance Ziwei Liu EMAIL S-Lab, Nanyang Technological University Chunyuan Li EMAIL Byte Dance |
| Pseudocode | No | The paper describes methods and pipelines in text and with diagrams (e.g., Figure 2 for the video detail description creation pipeline), but does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Open-Source: In an effort to support the development of general-purpose visual assistants, we release our multimodal instruction data, codebase, model checkpoints, and a visual chat demo to the public. |
| Open Datasets | Yes | Video-language Instruction-Following Data: We present a high-quality dataset LLaVA-Video-178K tailored for video instruction-following. It consists of 178K video with 1.3M instruction samples, including detailed captions, free-form and multiple-choice question answering. Open-Source: In an effort to support the development of general-purpose visual assistants, we release our multimodal instruction data, codebase, model checkpoints, and a visual chat demo to the public. We fine-tune LLaVA-One Vision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA (Yu et al., 2019), NExT-QA (Xiao et al., 2021), Perception Test (Pătrăucean et al., 2023), and LLaVA-Hound-255K (Zhang et al., 2024d). |
| Dataset Splits | Yes | For ablation studies in . 4.2 and Sec. 4.3, we conduct evaluation across 4 datasets. NExT-QA (Xiao et al., 2021) and Perception Test (Pătrăucean et al., 2023), which use training data from the LLaVA-Video-178K, are treated as in-domain datasets. Conversely, Video MME (Fu et al., 2024) and Ego Schema (Mangalam et al., 2024) are consider as zero-shot datasets. We fine-tune LLaVA-One Vision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA (Yu et al., 2019), NExT-QA (Xiao et al., 2021), Perception Test (Pătrăucean et al., 2023), and LLaVA-Hound-255K (Zhang et al., 2024d), focusing on videos shorter than three minutes. |
| Hardware Specification | Yes | On 128 NVIDIA H100 GPUs, the video representations for LLaVA-Video-7B and LLaVA-Video-72B are V = (64, 679, 1, 2) and V = (64, 679, 3, 2), respectively. ...with the Qwen2-72B model, we could only process 8 frames before maxing out the memory on 128 NVIDIA H100 GPUs. |
| Software Dependencies | No | The paper mentions several models and tools like GPT-4o, Py Scene Detect, SigLIP, Qwen2, and sentence-transformer with citations, but does not specify software versions for programming languages, libraries, or operating systems used for the experiments. |
| Experiment Setup | Yes | We fine-tune LLaVA-One Vision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA (Yu et al., 2019), NExT-QA (Xiao et al., 2021), Perception Test (Pătrăucean et al., 2023), and LLaVA-Hound-255K (Zhang et al., 2024d), focusing on videos shorter than three minutes. These datasets were selected to improve our model’s performance, contributing to a total of 1.6 million video-language samples, which include 193,510 video descriptions, 1,241,412 open-ended questions, and 215,625 multiple-choice questions. Additionally, we used 1.1 million image-language pairs from the LLaVA-One Vision model (Li et al., 2024c). We consider the same video representation configurations for the training and inference stages. On 128 NVIDIA H100 GPUs, the video representations for LLaVA-Video-7B and LLaVA-Video-72B are V = (64, 679, 1, 2) and V = (64, 679, 3, 2), respectively. |