LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Authors: Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Jim Fan, Yuke Zhu, Yao Lu, Song Han
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Long VILA-7B demonstrates strong accuracy on 9 popular video benchmarks, e.g., 65.1% Video MME with subtitle. Besides, MM-SP is 2.1 5.7 faster than ring style sequence parallelism and 1.1 1.4 faster than Megatron with a hybrid context and tensor parallelism. Our code and models are available at github.com/NVlabs/VILA/longvila. |
| Researcher Affiliation | Collaboration | 1NVIDIA 2MIT 3UC Berkeley 4UT Austin |
| Pseudocode | No | The paper describes the MM-SP workflow in a descriptive manner, outlining steps for sharding and communication, but does not provide structured pseudocode or an algorithm block. |
| Open Source Code | Yes | Our code and models are available at github.com/NVlabs/VILA/longvila. |
| Open Datasets | Yes | We first use open-sourced image and video caption datasets to train the multi-modal projector in stage (1) to conduct the multi-modal alignment. To improve the quality of large open-sourced datasets, we follow VILA2 (Fang et al., 2024) to relabel COYO-25M (Lin et al., 2023b; Byeon et al., 2022) with VILA-1.5-40B (Lin et al., 2023b). For short video comprehension, we utilize open-source video instruction-following datasets, e.g., You Cook2 Zhou et al. (2018) and Share GPTVideo Zhang et al. (2024c). We use the original long videos from the Shot2Story dataset (Han et al., 2023). |
| Dataset Splits | No | The paper mentions various datasets used for training and fine-tuning, including a newly constructed dataset for long video training. However, it does not explicitly provide specific training, validation, and test splits (e.g., percentages or exact counts) for any of these datasets in the main text. |
| Hardware Specification | Yes | These processes collectively require approximately 336 GPU hours on machines equipped with 80GB A100 GPUs. We conduct most experiments on H100 nodes, each equipped with 8x H100 (80GB) GPUs interconnected via intra-node NVLink and 400 Gbps inter-node Infini Band. For experiments involving the maximum supported sequence length during training, we extend the setup to 32 A100 nodes, each with 8x A100 (80GB) GPUs, where the conclusions are consistent with those for H100 due to the equivalent total memory. |
| Software Dependencies | Yes | Our system is currently implemented in Triton (Tillet et al., 2019). We use fp16 data type, Flash-Attention2 (Dao, 2024) on one A100 GPU for latency measurement. |
| Experiment Setup | Yes | Following Stage 2 of our methodology, we execute a continuation of pre-training on the LLM to enhance its context length to 262,144, utilizing a total of 17B tokens. We employ a progressive training schedule, incrementally increasing the context length from 8,192 to 65,536, and ultimately to 262,144, utilizing the Slim Pajama dataset (Soboleva et al., 2023) in accordance with the methodology outlined by (Fu et al., 2024d). We use low-rank adaptation for context extension finetuning (Chen et al., 2024b). Our evaluations are based on an 8B model with a batch size of 1. For k GPUs, we use k images per video and a batch size of k. The results were obtained after 10 warmup iterations and averaged over 5 iterations to minimize variance. |