reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Frame-Voyager: Learning to Query Frames for Video Large Language Models

Authors: Sicheng Yu, CHENGKAI JIN, Huanyu Wang, Zhenghao Chen, Sheng JIn, ZHONGRONG ZUO, Xiaolei XU, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, Qianru Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, we evaluate FRAME-VOYAGER on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that FRAME-VOYAGER achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs.
Researcher Affiliation	Collaboration	1Byte Dance 2Nanyang Technological University 3Singapore Management University
Pseudocode	No	The paper describes the model training and inference process using mathematical formulas and a diagram (Figure 2), but it does not include a clearly labeled pseudocode block or algorithm.
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code, nor does it include a direct link to a repository for the FRAME-VOYAGER methodology. It does reference and link to third-party tools or models used in experiments, but not its own implementation.
Open Datasets	Yes	We evaluate FRAME-VOYAGER on four widely-used Video Question Answering benchmarks, Video-MME (Fu et al., 2024a), MLVU (Zhou et al., 2024), Next QA (Xiao et al., 2021) and Activity Net-QA (Yu et al., 2019).
Dataset Splits	Yes	We select the training set of Next QA (Xiao et al., 2021) and Video Chat GPT (Maaz et al., 2024), on which we apply our proposed pipeline to create a training dataset for FRAME-VOYAGER. ... We evaluate FRAME-VOYAGER on four widely-adopted video benchmarks: Video MME (Fu et al., 2024a), MLVU (Zhou et al., 2024), Next QA (Xiao et al., 2021) and Activity Net QA (Yu et al., 2019). The LMMs-Eval Library (Li et al., 2024a) is used for evaluation, and accuracy is reported across all benchmarks.
Hardware Specification	Yes	VILA-8B is trained using Deep Speed (Aminabadi et al., 2022) Ze RO2 with 8 H100 GPUs, while VILA-40B is trained using Ze RO3 setting with 32 H100 GPUs. The batch size (with accumulation) is set to 64 and the learning rate is 1e 3. The training of FRAME-VOYAGER is conducted over 40 epochs requiring approximately 8 hours for VILA-8B whereas over 20 epochs for VILA-40B, taking around 20 hours. All model inferences are performed on 8 H100 GPUs.
Software Dependencies	No	The paper mentions several libraries and models such as 'Open CV2 library' (Appendix A) and 'Deep Speed (Aminabadi et al., 2022) Ze RO2', but none of them are specified with precise version numbers (e.g., OpenCV 2.x.y or DeepSpeed 0.x.y), which is required for reproducibility.
Experiment Setup	Yes	The batch size (with accumulation) is set to 64 and the learning rate is 1e 3. The training of FRAME-VOYAGER is conducted over 40 epochs requiring approximately 8 hours for VILA-8B whereas over 20 epochs for VILA-40B, taking around 20 hours.