LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Authors: Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate our method, we conduct extensive experiments across various video understanding benchmarks, including Ego Schema (Mangalam et al., 2024), MVBench (Li et al., 2024b), Video MME (Fu et al., 2024), and MLVU (Zhou et al., 2024). Our Long VU significantly outperforms several recent open-source video LLM models, such as Video Chat2 (Li et al., 2024b), Long VA (Zhang et al., 2024a), and LLa VA-One Vision (Li et al., 2024a), by a large margin.
Researcher Affiliation Collaboration 1Meta AI 2King Abdullah University of Science and Technology (KAUST) 3Korea University
Pseudocode No The paper describes the methods in prose and equations (e.g., Equation 1 and Equation 2) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our work introduces a spatiotemporal adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. Our open-sourced model paves the way for future research in video compression tailored for MLLM-based applications, enabling more effective long-video, media, and streaming video understanding.
Open Datasets Yes For the image-language pre-training stage... using Single Image data from LLa VA-One Vision (Li et al., 2024a). For video-language finetuning, we utilize a large-scale video-text pairs sourced from several publicly accessible databases. The video training data contains a subset of Video Chat2-IT (Li et al., 2024b), which includes Text VR (Wu et al., 2025), Youcook2 (Zhou et al., 2018), Kinetics-710 (Kay et al., 2017), NEx TQA (Xiao et al., 2021), CLEVRER (Yi et al., 2019), Ego QA (Fan, 2019), TGIF (Li et al., 2016), Web Vid QA (Yang et al., 2021), Share GPT4Video (Chen et al., 2024a), and Movie Chat (Song et al., 2024) as the long video complementary.
Dataset Splits Yes For Video MME (Fu et al., 2024), videos are officially split based on duration, which contains a subset of long videos ranging from 30 minutes to 1 hour.
Hardware Specification Yes Our model is trained on 64 NVIDIA H100 GPUs.
Software Dependencies No The paper mentions specific models and optimizers used (e.g., Qwen2-7B, Llama3.2-3B, AdamW) but does not provide specific version numbers for underlying software libraries or programming languages (e.g., Python, PyTorch).
Experiment Setup Yes In the image-language pre-training stage, we train the model for one epoch with global batch size of 128. The learning rate is set to 1e-5, and the warmup rate is 0.03. The number of tokens per image are set to 576. For the video-language finetuning stage, we train the model for one epoch with global batch size of 64. The learning rate is set to 1e-5, and the warmup rate is 0.03. The maximum number of tokens per frame are set to 144 (Hh = Wh = 12), while each might be reduced by our proposed adaptive compression approach ( 64, Hl = Wl = 8). The DINO threshold is set as 0.83 and the STC reduction threshold is θ = 0.75. The sliding window size K = 8.