Fine-Grained Captioning of Long Videos through Scene Graph Consolidation
Authors: Sanghyeok Chu, Seonguk Seo, Bohyung Han
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section presents the effectiveness of the proposed approach through performance evaluation and analysis on both video captioning and video paragraph captioning datasets. Our evaluation consists of two zero-shot tasks: (1) video captioning, using the standard test splits of MSR-VTT (Xu et al., 2016) and MSVD (Chen & Dolan, 2011), and (2) video paragraph captioning, using the ae-val set of Activity Net Captions (Krishna et al., 2017a), which contains longer videos with multiple events. |
| Researcher Affiliation | Collaboration | 1ECE, Seoul National University, Korea. 2IPAI, Seoul National University, Korea. : Currently at Meta. Correspondence to: Bohyung Han <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Scene graph consolidation |
| Open Source Code | No | The paper mentions using open-source models like "open-source Mistral-7B-Instruct-v0.31" and "BERT-base" but does not provide any explicit statement or link for the authors' own implementation code or framework (SGVC). |
| Open Datasets | Yes | To construct this dataset, we curated approximately 2.5 million captions from diverse image captioning datasets, including MS-COCO (Chen et al., 2015), Flickr30k (Young et al., 2014), Text Caps (Sidorov et al., 2020), Visual Genome (Krishna et al., 2017b), and Visual Genome Paragraph Captions (Krause et al., 2017), to cover a broad range of visual scene contexts. To further enrich the dataset, we incorporated model-generated captions for videos in Kinetics-400 (Kay et al., 2017), where LLa VANe XT-7B (Liu et al., 2024) is applied to four uniformly sampled frames per video. Our evaluation consists of two zero-shot tasks: (1) video captioning, using the standard test splits of MSR-VTT (Xu et al., 2016) and MSVD (Chen & Dolan, 2011), and (2) video paragraph captioning, using the ae-val set of Activity Net Captions (Krishna et al., 2017a), which contains longer videos with multiple events. |
| Dataset Splits | Yes | Our evaluation consists of two zero-shot tasks: (1) video captioning, using the standard test splits of MSR-VTT (Xu et al., 2016) and MSVD (Chen & Dolan, 2011), and (2) video paragraph captioning, using the ae-val set of Activity Net Captions (Krishna et al., 2017a), which contains longer videos with multiple events. |
| Hardware Specification | Yes | Table 5 presents a detailed comparison of computational costs, in terms of average per-video inference time and peak GPU memory usage on a single NVIDIA A6000 GPU |
| Software Dependencies | Yes | We use the open-source Mistral-7B-Instruct-v0.31 for all datasets. The BERT-base model (Devlin et al., 2019) is employed for our encoder... and only the decoder part of T5-base (Raffel et al., 2020) is used as our text decoder. |
| Experiment Setup | Yes | The graph-to-text model is trained on graph-text pairs constructed in Section 4.2 for 1K iterations with a batch size of 512. We employ the Adam W (Loshchilov, 2019) optimizer with a weight decay of 0.05, an initial learning rate of 0.0001, and linear warm-up for the first 1% of training steps. For video paragraph captioning, the model is further fine-tuned for 400 iterations on the subset of the constructed graph-text pairs obtained from the Visual Genome paragraph captioning dataset (Krause et al., 2017). For generating the final video caption, we apply a beam search with five beams, a maximum sequence length of 32 and a length penalty of 0.6. For video paragraph caption, which requires more detailed descriptions, is generated using a beam search with three beams, a maximum sequence length of 400, and a repetition penalty of 3.0. |