SCVBench: A Benchmark with Multi-turn Dialogues for Story-Centric Video Understanding
Authors: Sisi You, Bowen Yuan, Bing-Kun Bao
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | SCVBench evaluates LVLMs through an event ordering task decomposed into sub-questions leading to a final question, quantitatively measuring historical dialogue exploration. We collected 1,253 final questions and 6,027 subquestion pairs from 925 videos, constructing continuous multi-turn dialogues. Experimental results show that while closed-source GPT-4o outperforms other models, most open-source LVLMs struggle with story-centric video understanding. |
| Researcher Affiliation | Academia | Sisi You1,3 , Bowen Yuan1 and Bing-Kun Bao1,2B 1Nanjing University of Posts and Telecommunications 2Pengcheng Laboratory 3State Key Laboratory of Tibetan Intelligence EMAIL, EMAIL |
| Pseudocode | No | The paper describes the Story Co T model architecture and its agents (Event Extraction Agent, Story-Centric Reasoning Agent) in detail, but it does not present this information in a structured pseudocode or algorithm block format. |
| Open Source Code | Yes | Code can be accessed at https://github.com/yuanrr/SCVBench. |
| Open Datasets | Yes | To tackle these problems, we propose the first benchmark for story-centric video understanding, named Story-Centric Video understanding Benchmark (SCVBench), which aims at comprehensively evaluating the high-level long-term understanding capabilities of LVLMs. ... Code can be accessed at https://github.com/yuanrr/SCVBench. |
| Dataset Splits | No | The paper describes the composition of the SCVBench dataset, including the number of videos, questions, and average lengths, but it does not specify any training, validation, or test splits for the dataset. |
| Hardware Specification | No | The paper states, "We deploy our benchmark on Lmms-Eval [Zhang et al., 2024a], an evaluation tool for diverse LVLMs," but it does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions using "Lmms-Eval [Zhang et al., 2024a]" as an evaluation tool and evaluating "LVLMs," but it does not provide specific version numbers for any software components, libraries, or frameworks used. |
| Experiment Setup | Yes | We report the results of the Qwen2-VL series with 128 input frames. For FQA, we use the instruction Select the best answer to the following multiple-choice question based on the video and the historical conversations , while for PQA, we use Select the best answer to the following multiple-choice question based on the video and the given story, as well as the historical conversations. We employ the post-prompt Respond with only the letter (A, B, C, D, or E) of the correct option to collect option responses directly. |