ALLVB: All-in-One Long Video Understanding Benchmark

Authors: Xichen Tan, Yuanjing Luo, Yunfan Ye, Fang Liu, Zhiping Cai

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have tested various mainstream MLLMs on ALLVB, and the results indicate that even the most advanced commercial models have significant room for improvement. This reflects the benchmark s challenging nature and demonstrates the substantial potential for development in long video understanding.
Researcher Affiliation Academia 1College of Computer Science and Technology, National University of Defense Technology, Changsha, China 2School of Design, Hunan University, Changsha, China EMAIL
Pseudocode No The paper describes its methodology through a 'construction pipeline' diagram (Figure 1) and descriptive text, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper provides a link to its dataset: 'Datasets https://huggingface.co/datasets/ALLVB/ALLVB'. However, it does not provide any concrete access to source code for the methodology described in the paper.
Open Datasets Yes Datasets https://huggingface.co/datasets/ALLVB/ALLVB
Dataset Splits Yes First, we divide ALLVB into a training set and a test set at a 9:1 ratio, containing 1,236 and 140 videos, respectively.
Hardware Specification Yes Open-source models are run locally on an NVIDIA 4090, while closed-source models are accessed via official API.
Software Dependencies No The paper mentions several MLLMs (e.g., Otter-I, LLaVA-1.6, GPT-4o, Claude 3.5 Sonnet) and LLM parameters (7B), but it does not specify software versions for programming languages, libraries, or other dependencies like Python, PyTorch, or CUDA versions.
Experiment Setup Yes LLM parameters are uniformly set to 7B and all tests are conducted in a 0-shot format. Among them... all models receive 16 frames uniformly sampled across the entire video, along with the corresponding subtitles for these frames. LLM parameters are uniformly set to 7B and all tests are conducted in a 0-shot format. Open-source models are run locally... while closed-source models are accessed via official API. Open-source models answer each Q&A in the video individually, while closed-source models answer all Q&As in a single video at once. Open-source and closed-source models each use the same prompts to ensure fairness. The specific prompt details are as follows:...