Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Authors: Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate VILAMP s superior performance across five video understanding benchmarks, particularly on long-form content. Notably, VILAMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance. |
| Researcher Affiliation | Collaboration | 1Gaoling School of Artificial Intelligence, Renmin University of China 2Ant Group 3School of Artificial Intelligence, Wuhan University. Correspondence to: Wei Wu <EMAIL>, Rui Yan <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Differential Keyframe Selection Data: V = f1, f2, , f N : Video frames; Q: Query; K: Maximum keyframes; τ: Similarity threshold. Result: K: Keyframe Set. // Sort all frames by query-relevance in descending order 1 ˆf1, ˆf2, , ˆf N = sorted V, key = lambda fn : Rf(fn, Q) // Initialize K with the most relevant frame 2 K = { ˆf1} // Iterate through the sorted frames 3 while n 1 to N do 4 if |K| < K and Tf( ˆfn, K) < τ then 5 K.add( ˆfn) |
| Open Source Code | Yes | Code and model are available at https://github. com/steven-ccq/Vi LAMP. |
| Open Datasets | Yes | We thoroughly evaluate VILAMP on five video understanding benchmarks spanning diverse temporal scales and tasks (See Appendix A.1 for details): LVBench (Wang et al., 2024b) for long-term decisionmaking, Ego Schema (Mangalam et al., 2023) for natural scenario understanding, Long Video Bench (Wu et al., 2024) for referred reasoning, MLVU (Zhou et al., 2024) for multitask question answering, and Video-MME (Fu et al., 2024a) for comprehensive video understanding. |
| Dataset Splits | Yes | To rigorously address above limitations, we develop Video NIAH, a more challenging variant of the NIAH task specifically designed for video content. We construct Video NIAH by sampling long-form videos from Video-MME to create haystacks ranging from 2K to 10K frames (at 1FPS). We then insert needle video clips (30 120 seconds) within these haystacks at random positions. Both the needle videos and their corresponding query-answer pairs are sampled from the original Video-MME dataset. Models must not only locate these needles but also comprehend their temporal content to answer targeted queries. We create 3K test cases that distributed across five haystack lengths (2K, 4K, 6K, 8K, and 10K frames), with carefully balanced question types to ensure comprehensive evaluation. The testing was conducted in a non-subtitled setting. |
| Hardware Specification | Yes | Notably, VILAMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance. |
| Software Dependencies | No | VILAMP employs the same architecture as LLa VA-One Vision (Li et al., 2024a), utilizing Sig LIP-so400m3 as the visual encoder, a two-layer MLP as the vision-language connector and Qwen2-7B4 as the language model. We process frames at 384 384 resolution and empirically set τ in Alg. 1 to 0.85, the keyframe count K to 32, λ in Eq. 5 to 1, and α in Eq. 10 to 10 2 unless otherwise specified. We employ CLIP-Vi T-B-325 as Ef( ) in Eq. 3. |
| Experiment Setup | Yes | VILAMP employs the same architecture as LLa VA-One Vision (Li et al., 2024a), utilizing Sig LIP-so400m3 as the visual encoder, a two-layer MLP as the vision-language connector and Qwen2-7B4 as the language model. We process frames at 384 384 resolution and empirically set τ in Alg. 1 to 0.85, the keyframe count K to 32, λ in Eq. 5 to 1, and α in Eq. 10 to 10 2 unless otherwise specified. We employ CLIP-Vi T-B-325 as Ef( ) in Eq. 3. More training details are provided in Appendix A.3. Appendix A.3 (Training Parameters): As mentioned in B.2, we configure α = 10 2 and τ = 0.85 to achieve balanced effectiveness. During training, we set learning rate to 2 10 6 for the visual encoder and 10 5 for the remaining components. We use adam W as the optimizer. The optimization follows a cosine learning rate scheduler with a warmup ratio of 0.03. We set the batch size to 1 and the gradient accumulation steps to 4. We train the model for 1 epoch, which costs approximately two weeks with 32 NVIDIA A100 GPUs. |