reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Authors: Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Jim Fan, Yuke Zhu, Yao Lu, Song Han

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Long VILA-7B demonstrates strong accuracy on 9 popular video benchmarks, e.g., 65.1% Video MME with subtitle. Besides, MM-SP is 2.1 5.7 faster than ring style sequence parallelism and 1.1 1.4 faster than Megatron with a hybrid context and tensor parallelism. Our code and models are available at github.com/NVlabs/VILA/longvila.
Researcher Affiliation	Collaboration	1NVIDIA 2MIT 3UC Berkeley 4UT Austin
Pseudocode	No	The paper describes the MM-SP workflow in a descriptive manner, outlining steps for sharding and communication, but does not provide structured pseudocode or an algorithm block.
Open Source Code	Yes	Our code and models are available at github.com/NVlabs/VILA/longvila.
Open Datasets	Yes	We first use open-sourced image and video caption datasets to train the multi-modal projector in stage (1) to conduct the multi-modal alignment. To improve the quality of large open-sourced datasets, we follow VILA2 (Fang et al., 2024) to relabel COYO-25M (Lin et al., 2023b; Byeon et al., 2022) with VILA-1.5-40B (Lin et al., 2023b). For short video comprehension, we utilize open-source video instruction-following datasets, e.g., You Cook2 Zhou et al. (2018) and Share GPTVideo Zhang et al. (2024c). We use the original long videos from the Shot2Story dataset (Han et al., 2023).
Dataset Splits	No	The paper mentions various datasets used for training and fine-tuning, including a newly constructed dataset for long video training. However, it does not explicitly provide specific training, validation, and test splits (e.g., percentages or exact counts) for any of these datasets in the main text.
Hardware Specification	Yes	These processes collectively require approximately 336 GPU hours on machines equipped with 80GB A100 GPUs. We conduct most experiments on H100 nodes, each equipped with 8x H100 (80GB) GPUs interconnected via intra-node NVLink and 400 Gbps inter-node Infini Band. For experiments involving the maximum supported sequence length during training, we extend the setup to 32 A100 nodes, each with 8x A100 (80GB) GPUs, where the conclusions are consistent with those for H100 due to the equivalent total memory.
Software Dependencies	Yes	Our system is currently implemented in Triton (Tillet et al., 2019). We use fp16 data type, Flash-Attention2 (Dao, 2024) on one A100 GPU for latency measurement.
Experiment Setup	Yes	Following Stage 2 of our methodology, we execute a continuation of pre-training on the LLM to enhance its context length to 262,144, utilizing a total of 17B tokens. We employ a progressive training schedule, incrementally increasing the context length from 8,192 to 65,536, and ultimately to 262,144, utilizing the Slim Pajama dataset (Soboleva et al., 2023) in accordance with the methodology outlined by (Fu et al., 2024d). We use low-rank adaptation for context extension finetuning (Chen et al., 2024b). Our evaluations are based on an 8B model with a batch size of 1. For k GPUs, we use k images per video and a batch size of k. The results were obtained after 10 warmup iterations and averaged over 5 iterations to minimize variance.