Improving LLM Video Understanding with 16 Frames Per Second
Authors: Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video MME and Temporal Bench. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2Byte Dance. Correspondence to: Chao Zhang <EMAIL>. |
| Pseudocode | No | The paper describes the model architecture and components using mathematical equations and textual explanations (e.g., Section 3.1 Model Architecture) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We will release the source code, model checkpoints, and data at https://github.com/bytedance/F-16. |
| Open Datasets | Yes | The training data of general videos are the same as LLa VAVideo (Zhang et al., 2024b), including LLa VA-Video-178K (Zhang et al., 2024b), LLa VA-Hound (Zhang et al., 2024a), NEx T-QA (Xiao et al., 2021), Activity Net-QA (Yu et al., 2019) and Perception Test (Patraucean et al., 2024). Besides generic video understanding, we also fine-tune the model on high-speed sports videos. Videos for gymnastics, diving, basketball, and football are collected for further tuning, where Fine Gym (Shao et al., 2020), Diving48 (Li et al., 2018), Soccer Net (Giancola et al., 2018), and NBA video clips are used respectively. |
| Dataset Splits | Yes | Regarding the Fine Gym (Shao et al., 2020) data for gymnastics understanding, we sample 90% clips as the training set while the remaining 10% as the test set and ensure that the duration of videos in the training and test sets is balanced. Regarding the Diving48 (Li et al., 2018) data for diving understanding, we use its official data split. |
| Hardware Specification | Yes | F-16 is trained for 1 epoch on the training data using 128 H100 GPUs, with a learning rate set to 2 10 5. We fine-tune F-16 using 64 H100 GPUs for 5 epochs, with a learning rate set to 2 10 5. |
| Software Dependencies | No | The paper mentions using 'LLa VA-OV model of LLa VA-One Vision (Li et al., 2024)' and 'Qwen2-7B (Yang et al., 2024) as the backbone LLM', and 'Sig LIP (Zhai et al., 2023) as the visual encoder', and 'Lo RA (Hu et al., 2022)'. These are models or techniques, not specific software dependencies with version numbers (e.g., Python version, PyTorch version, CUDA version). |
| Experiment Setup | Yes | F-16 is trained for 1 epoch on the training data using 128 H100 GPUs, with a learning rate set to 2 10 5. For further tuning the model on high-speed sports data, Lo RA (Hu et al., 2022) is adapted to the LLM and serves as the only trainable module in this stage. The rank and the scaling factor of Lo RA are set to 128 and 2.0, respectively. We fine-tune F-16 using 64 H100 GPUs for 5 epochs, with a learning rate set to 2 10 5. |