reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Authors: Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, Chunyuan Li

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We conducted evaluations for the LLaVA-Video models across all benchmarks using LMMs-Eval (Zhang et al., 2024a) to ensure standardization and reproducibility. For full evaluation, we consdier 11 video benchmarks. conducted tests across various video captioning , video open-ended question-answering and video multiple-choice question-answering benchmarks.
Researcher Affiliation	Collaboration	Yuanhan Zhang EMAIL S-Lab, Nanyang Technological University Jinming Wu EMAIL BUPT Wei Li EMAIL Byte Dance Bo Li EMAIL S-Lab, Nanyang Technological University Zejun Ma EMAIL Byte Dance Ziwei Liu EMAIL S-Lab, Nanyang Technological University Chunyuan Li EMAIL Byte Dance
Pseudocode	No	The paper describes methods and pipelines in text and with diagrams (e.g., Figure 2 for the video detail description creation pipeline), but does not present any formal pseudocode or algorithm blocks.
Open Source Code	Yes	Open-Source: In an effort to support the development of general-purpose visual assistants, we release our multimodal instruction data, codebase, model checkpoints, and a visual chat demo to the public.
Open Datasets	Yes	Video-language Instruction-Following Data: We present a high-quality dataset LLaVA-Video-178K tailored for video instruction-following. It consists of 178K video with 1.3M instruction samples, including detailed captions, free-form and multiple-choice question answering. Open-Source: In an effort to support the development of general-purpose visual assistants, we release our multimodal instruction data, codebase, model checkpoints, and a visual chat demo to the public. We fine-tune LLaVA-One Vision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA (Yu et al., 2019), NExT-QA (Xiao et al., 2021), Perception Test (Pătrăucean et al., 2023), and LLaVA-Hound-255K (Zhang et al., 2024d).
Dataset Splits	Yes	For ablation studies in . 4.2 and Sec. 4.3, we conduct evaluation across 4 datasets. NExT-QA (Xiao et al., 2021) and Perception Test (Pătrăucean et al., 2023), which use training data from the LLaVA-Video-178K, are treated as in-domain datasets. Conversely, Video MME (Fu et al., 2024) and Ego Schema (Mangalam et al., 2024) are consider as zero-shot datasets. We fine-tune LLaVA-One Vision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA (Yu et al., 2019), NExT-QA (Xiao et al., 2021), Perception Test (Pătrăucean et al., 2023), and LLaVA-Hound-255K (Zhang et al., 2024d), focusing on videos shorter than three minutes.
Hardware Specification	Yes	On 128 NVIDIA H100 GPUs, the video representations for LLaVA-Video-7B and LLaVA-Video-72B are V = (64, 679, 1, 2) and V = (64, 679, 3, 2), respectively. ...with the Qwen2-72B model, we could only process 8 frames before maxing out the memory on 128 NVIDIA H100 GPUs.
Software Dependencies	No	The paper mentions several models and tools like GPT-4o, Py Scene Detect, SigLIP, Qwen2, and sentence-transformer with citations, but does not specify software versions for programming languages, libraries, or operating systems used for the experiments.
Experiment Setup	Yes	We fine-tune LLaVA-One Vision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA (Yu et al., 2019), NExT-QA (Xiao et al., 2021), Perception Test (Pătrăucean et al., 2023), and LLaVA-Hound-255K (Zhang et al., 2024d), focusing on videos shorter than three minutes. These datasets were selected to improve our model’s performance, contributing to a total of 1.6 million video-language samples, which include 193,510 video descriptions, 1,241,412 open-ended questions, and 215,625 multiple-choice questions. Additionally, we used 1.1 million image-language pairs from the LLaVA-One Vision model (Li et al., 2024c). We consider the same video representation configurations for the training and inference stages. On 128 NVIDIA H100 GPUs, the video representations for LLaVA-Video-7B and LLaVA-Video-72B are V = (64, 679, 1, 2) and V = (64, 679, 3, 2), respectively.