Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
Authors: Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao, Heng Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we present a new multishot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and question-answering pairs. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video captioning, multi-shot video summarization, and multi-shot video question answering. Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos. Through extensive experiments, we show that: (1) the ASR text is critical to joint understanding of visual and audio content, (2) processing the video as a whole without the shot structure degenerates the model s capacity of understanding the multi-shot video, (3) the summarization model trained on our multi-shot summaries can be used on the proposed multi-shot QA benchmark and generalized to other datasets with longer durations (Activity Net(Krishna et al., 2017)) and out-of-domain topics (MSRVTT(Xu et al., 2016)), validating the quality of our annotated summaries. |
| Researcher Affiliation | Collaboration | Mingfei Han1,2,3,5 , Linjie Yang1 , Xiaojun Chang3,4, Lina Yao5, Heng Wang1 1Bytedance Inc. 2Re LER Lab, AAII, UTS 3Department of Computer Vision, MBZUAI 4University of Science and Technology of China 5Data61, CSIRO |
| Pseudocode | No | The paper describes methods and models in prose and through architectural diagrams (e.g., Figure 4), but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/bytedance/Shot2Story |
| Open Datasets | Yes | In this work, we propose a new benchmark Shot2Story for audio-visual understanding of multi-shot videos. We collect a dataset of 42,958 short videos where the average number of shots in each video is 4.4. For each video shot, we annotate a detailed textual description for the video frames and another textual description for the human speech. We also leverage a state-of-the-art large language model (LLM) GPT-4 (Open AI) to generate a long textual video summary from the annotated clip descriptions, which are further verified by human annotators. The summary includes additional details such as transitions of different shots, progression of multiple events, and mapping of the subject identities in different scenes. An example of one annotated video is shown in Figure 1. https://github.com/bytedance/Shot2Story |
| Dataset Splits | Yes | For all the tasks described in this section, we follow the same training/validation/test split. Specifically, the number of videos for training, validation, and test set are 36951, 1982 and 4025, respectively. |
| Hardware Specification | No | The paper describes the software models and components used (e.g., Mini GPT4, Video Chat2, ViT-G/14, Q-Former, Vicuna v0-7B, UMT-L), but does not specify any particular hardware like GPU or CPU models used for experimentation. |
| Software Dependencies | Yes | For Mini GPT-4, we employ Vi T-G/14 (Fang et al., 2022) and Q-Former (Li et al., 2023a) as visual encoder, and Vicuna v0-7B (Chiang et al., 2023) as the language model. ... For Video Chat2, we employ UMT-L(Li et al., 2023d) as backbone and load pretrained Q-Former and MLP from Video Chat2 (Li et al., 2023c). |
| Experiment Setup | Yes | During training, we adopt Lo RA(Hu et al., 2021) and Adam W (Loshchilov & Hutter, 2017) with a learning rate of 8e-5. We train both models for 10 epochs with a batch size of 128 for single-shot video captioning. We finetune our video summarization models on the single-shot captioning models with a batch size of 32. |