SCVBench: A Benchmark with Multi-turn Dialogues for Story-Centric Video Understanding

Authors: Sisi You, Bowen Yuan, Bing-Kun Bao

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental SCVBench evaluates LVLMs through an event ordering task decomposed into sub-questions leading to a final question, quantitatively measuring historical dialogue exploration. We collected 1,253 final questions and 6,027 subquestion pairs from 925 videos, constructing continuous multi-turn dialogues. Experimental results show that while closed-source GPT-4o outperforms other models, most open-source LVLMs struggle with story-centric video understanding.
Researcher Affiliation Academia Sisi You1,3 , Bowen Yuan1 and Bing-Kun Bao1,2B 1Nanjing University of Posts and Telecommunications 2Pengcheng Laboratory 3State Key Laboratory of Tibetan Intelligence EMAIL, EMAIL
Pseudocode No The paper describes the Story Co T model architecture and its agents (Event Extraction Agent, Story-Centric Reasoning Agent) in detail, but it does not present this information in a structured pseudocode or algorithm block format.
Open Source Code Yes Code can be accessed at https://github.com/yuanrr/SCVBench.
Open Datasets Yes To tackle these problems, we propose the first benchmark for story-centric video understanding, named Story-Centric Video understanding Benchmark (SCVBench), which aims at comprehensively evaluating the high-level long-term understanding capabilities of LVLMs. ... Code can be accessed at https://github.com/yuanrr/SCVBench.
Dataset Splits No The paper describes the composition of the SCVBench dataset, including the number of videos, questions, and average lengths, but it does not specify any training, validation, or test splits for the dataset.
Hardware Specification No The paper states, "We deploy our benchmark on Lmms-Eval [Zhang et al., 2024a], an evaluation tool for diverse LVLMs," but it does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions using "Lmms-Eval [Zhang et al., 2024a]" as an evaluation tool and evaluating "LVLMs," but it does not provide specific version numbers for any software components, libraries, or frameworks used.
Experiment Setup Yes We report the results of the Qwen2-VL series with 128 input frames. For FQA, we use the instruction Select the best answer to the following multiple-choice question based on the video and the historical conversations , while for PQA, we use Select the best answer to the following multiple-choice question based on the video and the given story, as well as the historical conversations. We employ the post-prompt Respond with only the letter (A, B, C, D, or E) of the correct option to collect option responses directly.