reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving LLM Video Understanding with 16 Frames Per Second

Authors: Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video MME and Temporal Bench.
Researcher Affiliation	Collaboration	1Tsinghua University 2Byte Dance. Correspondence to: Chao Zhang <EMAIL>.
Pseudocode	No	The paper describes the model architecture and components using mathematical equations and textual explanations (e.g., Section 3.1 Model Architecture) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We will release the source code, model checkpoints, and data at https://github.com/bytedance/F-16.
Open Datasets	Yes	The training data of general videos are the same as LLa VAVideo (Zhang et al., 2024b), including LLa VA-Video-178K (Zhang et al., 2024b), LLa VA-Hound (Zhang et al., 2024a), NEx T-QA (Xiao et al., 2021), Activity Net-QA (Yu et al., 2019) and Perception Test (Patraucean et al., 2024). Besides generic video understanding, we also fine-tune the model on high-speed sports videos. Videos for gymnastics, diving, basketball, and football are collected for further tuning, where Fine Gym (Shao et al., 2020), Diving48 (Li et al., 2018), Soccer Net (Giancola et al., 2018), and NBA video clips are used respectively.
Dataset Splits	Yes	Regarding the Fine Gym (Shao et al., 2020) data for gymnastics understanding, we sample 90% clips as the training set while the remaining 10% as the test set and ensure that the duration of videos in the training and test sets is balanced. Regarding the Diving48 (Li et al., 2018) data for diving understanding, we use its official data split.
Hardware Specification	Yes	F-16 is trained for 1 epoch on the training data using 128 H100 GPUs, with a learning rate set to 2 10 5. We fine-tune F-16 using 64 H100 GPUs for 5 epochs, with a learning rate set to 2 10 5.
Software Dependencies	No	The paper mentions using 'LLa VA-OV model of LLa VA-One Vision (Li et al., 2024)' and 'Qwen2-7B (Yang et al., 2024) as the backbone LLM', and 'Sig LIP (Zhai et al., 2023) as the visual encoder', and 'Lo RA (Hu et al., 2022)'. These are models or techniques, not specific software dependencies with version numbers (e.g., Python version, PyTorch version, CUDA version).
Experiment Setup	Yes	F-16 is trained for 1 epoch on the training data using 128 H100 GPUs, with a learning rate set to 2 10 5. For further tuning the model on high-speed sports data, Lo RA (Hu et al., 2022) is adapted to the LLM and serves as the only trainable module in this stage. The rank and the scaling factor of Lo RA are set to 128 and 2.0, respectively. We fine-tune F-16 using 64 H100 GPUs for 5 epochs, with a learning rate set to 2 10 5.