reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SCVBench: A Benchmark with Multi-turn Dialogues for Story-Centric Video Understanding

Authors: Sisi You, Bowen Yuan, Bing-Kun Bao

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	SCVBench evaluates LVLMs through an event ordering task decomposed into sub-questions leading to a final question, quantitatively measuring historical dialogue exploration. We collected 1,253 final questions and 6,027 subquestion pairs from 925 videos, constructing continuous multi-turn dialogues. Experimental results show that while closed-source GPT-4o outperforms other models, most open-source LVLMs struggle with story-centric video understanding.
Researcher Affiliation	Academia	Sisi You1,3 , Bowen Yuan1 and Bing-Kun Bao1,2B 1Nanjing University of Posts and Telecommunications 2Pengcheng Laboratory 3State Key Laboratory of Tibetan Intelligence EMAIL, EMAIL
Pseudocode	No	The paper describes the Story Co T model architecture and its agents (Event Extraction Agent, Story-Centric Reasoning Agent) in detail, but it does not present this information in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Code can be accessed at https://github.com/yuanrr/SCVBench.
Open Datasets	Yes	To tackle these problems, we propose the first benchmark for story-centric video understanding, named Story-Centric Video understanding Benchmark (SCVBench), which aims at comprehensively evaluating the high-level long-term understanding capabilities of LVLMs. ... Code can be accessed at https://github.com/yuanrr/SCVBench.
Dataset Splits	No	The paper describes the composition of the SCVBench dataset, including the number of videos, questions, and average lengths, but it does not specify any training, validation, or test splits for the dataset.
Hardware Specification	No	The paper states, "We deploy our benchmark on Lmms-Eval [Zhang et al., 2024a], an evaluation tool for diverse LVLMs," but it does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions using "Lmms-Eval [Zhang et al., 2024a]" as an evaluation tool and evaluating "LVLMs," but it does not provide specific version numbers for any software components, libraries, or frameworks used.
Experiment Setup	Yes	We report the results of the Qwen2-VL series with 128 input frames. For FQA, we use the instruction Select the best answer to the following multiple-choice question based on the video and the historical conversations , while for PQA, we use Select the best answer to the following multiple-choice question based on the video and the given story, as well as the historical conversations. We employ the post-prompt Respond with only the letter (A, B, C, D, or E) of the correct option to collect option responses directly.