Frame-Voyager: Learning to Query Frames for Video Large Language Models
Authors: Sicheng Yu, CHENGKAI JIN, Huanyu Wang, Zhenghao Chen, Sheng JIn, ZHONGRONG ZUO, Xiaolei XU, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, Qianru Sun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we evaluate FRAME-VOYAGER on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that FRAME-VOYAGER achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs. |
| Researcher Affiliation | Collaboration | 1Byte Dance 2Nanyang Technological University 3Singapore Management University |
| Pseudocode | No | The paper describes the model training and inference process using mathematical formulas and a diagram (Figure 2), but it does not include a clearly labeled pseudocode block or algorithm. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code, nor does it include a direct link to a repository for the FRAME-VOYAGER methodology. It does reference and link to third-party tools or models used in experiments, but not its own implementation. |
| Open Datasets | Yes | We evaluate FRAME-VOYAGER on four widely-used Video Question Answering benchmarks, Video-MME (Fu et al., 2024a), MLVU (Zhou et al., 2024), Next QA (Xiao et al., 2021) and Activity Net-QA (Yu et al., 2019). |
| Dataset Splits | Yes | We select the training set of Next QA (Xiao et al., 2021) and Video Chat GPT (Maaz et al., 2024), on which we apply our proposed pipeline to create a training dataset for FRAME-VOYAGER. ... We evaluate FRAME-VOYAGER on four widely-adopted video benchmarks: Video MME (Fu et al., 2024a), MLVU (Zhou et al., 2024), Next QA (Xiao et al., 2021) and Activity Net QA (Yu et al., 2019). The LMMs-Eval Library (Li et al., 2024a) is used for evaluation, and accuracy is reported across all benchmarks. |
| Hardware Specification | Yes | VILA-8B is trained using Deep Speed (Aminabadi et al., 2022) Ze RO2 with 8 H100 GPUs, while VILA-40B is trained using Ze RO3 setting with 32 H100 GPUs. The batch size (with accumulation) is set to 64 and the learning rate is 1e 3. The training of FRAME-VOYAGER is conducted over 40 epochs requiring approximately 8 hours for VILA-8B whereas over 20 epochs for VILA-40B, taking around 20 hours. All model inferences are performed on 8 H100 GPUs. |
| Software Dependencies | No | The paper mentions several libraries and models such as 'Open CV2 library' (Appendix A) and 'Deep Speed (Aminabadi et al., 2022) Ze RO2', but none of them are specified with precise version numbers (e.g., OpenCV 2.x.y or DeepSpeed 0.x.y), which is required for reproducibility. |
| Experiment Setup | Yes | The batch size (with accumulation) is set to 64 and the learning rate is 1e 3. The training of FRAME-VOYAGER is conducted over 40 epochs requiring approximately 8 hours for VILA-8B whereas over 20 epochs for VILA-40B, taking around 20 hours. |