reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Authors: Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao, Heng Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we present a new multishot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and question-answering pairs. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video captioning, multi-shot video summarization, and multi-shot video question answering. Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos. Through extensive experiments, we show that: (1) the ASR text is critical to joint understanding of visual and audio content, (2) processing the video as a whole without the shot structure degenerates the model s capacity of understanding the multi-shot video, (3) the summarization model trained on our multi-shot summaries can be used on the proposed multi-shot QA benchmark and generalized to other datasets with longer durations (Activity Net(Krishna et al., 2017)) and out-of-domain topics (MSRVTT(Xu et al., 2016)), validating the quality of our annotated summaries.
Researcher Affiliation	Collaboration	Mingfei Han1,2,3,5 , Linjie Yang1 , Xiaojun Chang3,4, Lina Yao5, Heng Wang1 1Bytedance Inc. 2Re LER Lab, AAII, UTS 3Department of Computer Vision, MBZUAI 4University of Science and Technology of China 5Data61, CSIRO
Pseudocode	No	The paper describes methods and models in prose and through architectural diagrams (e.g., Figure 4), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/bytedance/Shot2Story
Open Datasets	Yes	In this work, we propose a new benchmark Shot2Story for audio-visual understanding of multi-shot videos. We collect a dataset of 42,958 short videos where the average number of shots in each video is 4.4. For each video shot, we annotate a detailed textual description for the video frames and another textual description for the human speech. We also leverage a state-of-the-art large language model (LLM) GPT-4 (Open AI) to generate a long textual video summary from the annotated clip descriptions, which are further veriﬁed by human annotators. The summary includes additional details such as transitions of different shots, progression of multiple events, and mapping of the subject identities in different scenes. An example of one annotated video is shown in Figure 1. https://github.com/bytedance/Shot2Story
Dataset Splits	Yes	For all the tasks described in this section, we follow the same training/validation/test split. Speciﬁcally, the number of videos for training, validation, and test set are 36951, 1982 and 4025, respectively.
Hardware Specification	No	The paper describes the software models and components used (e.g., Mini GPT4, Video Chat2, ViT-G/14, Q-Former, Vicuna v0-7B, UMT-L), but does not specify any particular hardware like GPU or CPU models used for experimentation.
Software Dependencies	Yes	For Mini GPT-4, we employ Vi T-G/14 (Fang et al., 2022) and Q-Former (Li et al., 2023a) as visual encoder, and Vicuna v0-7B (Chiang et al., 2023) as the language model. ... For Video Chat2, we employ UMT-L(Li et al., 2023d) as backbone and load pretrained Q-Former and MLP from Video Chat2 (Li et al., 2023c).
Experiment Setup	Yes	During training, we adopt Lo RA(Hu et al., 2021) and Adam W (Loshchilov & Hutter, 2017) with a learning rate of 8e-5. We train both models for 10 epochs with a batch size of 128 for single-shot video captioning. We ﬁnetune our video summarization models on the single-shot captioning models with a batch size of 32.