Fine-Grained Captioning of Long Videos through Scene Graph Consolidation

Authors: Sanghyeok Chu, Seonguk Seo, Bohyung Han

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section presents the effectiveness of the proposed approach through performance evaluation and analysis on both video captioning and video paragraph captioning datasets. Our evaluation consists of two zero-shot tasks: (1) video captioning, using the standard test splits of MSR-VTT (Xu et al., 2016) and MSVD (Chen & Dolan, 2011), and (2) video paragraph captioning, using the ae-val set of Activity Net Captions (Krishna et al., 2017a), which contains longer videos with multiple events.
Researcher Affiliation Collaboration 1ECE, Seoul National University, Korea. 2IPAI, Seoul National University, Korea. : Currently at Meta. Correspondence to: Bohyung Han <EMAIL>.
Pseudocode Yes Algorithm 1 Scene graph consolidation
Open Source Code No The paper mentions using open-source models like "open-source Mistral-7B-Instruct-v0.31" and "BERT-base" but does not provide any explicit statement or link for the authors' own implementation code or framework (SGVC).
Open Datasets Yes To construct this dataset, we curated approximately 2.5 million captions from diverse image captioning datasets, including MS-COCO (Chen et al., 2015), Flickr30k (Young et al., 2014), Text Caps (Sidorov et al., 2020), Visual Genome (Krishna et al., 2017b), and Visual Genome Paragraph Captions (Krause et al., 2017), to cover a broad range of visual scene contexts. To further enrich the dataset, we incorporated model-generated captions for videos in Kinetics-400 (Kay et al., 2017), where LLa VANe XT-7B (Liu et al., 2024) is applied to four uniformly sampled frames per video. Our evaluation consists of two zero-shot tasks: (1) video captioning, using the standard test splits of MSR-VTT (Xu et al., 2016) and MSVD (Chen & Dolan, 2011), and (2) video paragraph captioning, using the ae-val set of Activity Net Captions (Krishna et al., 2017a), which contains longer videos with multiple events.
Dataset Splits Yes Our evaluation consists of two zero-shot tasks: (1) video captioning, using the standard test splits of MSR-VTT (Xu et al., 2016) and MSVD (Chen & Dolan, 2011), and (2) video paragraph captioning, using the ae-val set of Activity Net Captions (Krishna et al., 2017a), which contains longer videos with multiple events.
Hardware Specification Yes Table 5 presents a detailed comparison of computational costs, in terms of average per-video inference time and peak GPU memory usage on a single NVIDIA A6000 GPU
Software Dependencies Yes We use the open-source Mistral-7B-Instruct-v0.31 for all datasets. The BERT-base model (Devlin et al., 2019) is employed for our encoder... and only the decoder part of T5-base (Raffel et al., 2020) is used as our text decoder.
Experiment Setup Yes The graph-to-text model is trained on graph-text pairs constructed in Section 4.2 for 1K iterations with a batch size of 512. We employ the Adam W (Loshchilov, 2019) optimizer with a weight decay of 0.05, an initial learning rate of 0.0001, and linear warm-up for the first 1% of training steps. For video paragraph captioning, the model is further fine-tuned for 400 iterations on the subset of the constructed graph-text pairs obtained from the Visual Genome paragraph captioning dataset (Krause et al., 2017). For generating the final video caption, we apply a beam search with five beams, a maximum sequence length of 32 and a length penalty of 0.6. For video paragraph caption, which requires more detailed descriptions, is generated using a beam search with three beams, a maximum sequence length of 400, and a repetition penalty of 3.0.