reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fine-Grained Captioning of Long Videos through Scene Graph Consolidation

Authors: Sanghyeok Chu, Seonguk Seo, Bohyung Han

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section presents the effectiveness of the proposed approach through performance evaluation and analysis on both video captioning and video paragraph captioning datasets. Our evaluation consists of two zero-shot tasks: (1) video captioning, using the standard test splits of MSR-VTT (Xu et al., 2016) and MSVD (Chen & Dolan, 2011), and (2) video paragraph captioning, using the ae-val set of Activity Net Captions (Krishna et al., 2017a), which contains longer videos with multiple events.
Researcher Affiliation	Collaboration	1ECE, Seoul National University, Korea. 2IPAI, Seoul National University, Korea. : Currently at Meta. Correspondence to: Bohyung Han <EMAIL>.
Pseudocode	Yes	Algorithm 1 Scene graph consolidation
Open Source Code	No	The paper mentions using open-source models like "open-source Mistral-7B-Instruct-v0.31" and "BERT-base" but does not provide any explicit statement or link for the authors' own implementation code or framework (SGVC).
Open Datasets	Yes	To construct this dataset, we curated approximately 2.5 million captions from diverse image captioning datasets, including MS-COCO (Chen et al., 2015), Flickr30k (Young et al., 2014), Text Caps (Sidorov et al., 2020), Visual Genome (Krishna et al., 2017b), and Visual Genome Paragraph Captions (Krause et al., 2017), to cover a broad range of visual scene contexts. To further enrich the dataset, we incorporated model-generated captions for videos in Kinetics-400 (Kay et al., 2017), where LLa VANe XT-7B (Liu et al., 2024) is applied to four uniformly sampled frames per video. Our evaluation consists of two zero-shot tasks: (1) video captioning, using the standard test splits of MSR-VTT (Xu et al., 2016) and MSVD (Chen & Dolan, 2011), and (2) video paragraph captioning, using the ae-val set of Activity Net Captions (Krishna et al., 2017a), which contains longer videos with multiple events.
Dataset Splits	Yes	Our evaluation consists of two zero-shot tasks: (1) video captioning, using the standard test splits of MSR-VTT (Xu et al., 2016) and MSVD (Chen & Dolan, 2011), and (2) video paragraph captioning, using the ae-val set of Activity Net Captions (Krishna et al., 2017a), which contains longer videos with multiple events.
Hardware Specification	Yes	Table 5 presents a detailed comparison of computational costs, in terms of average per-video inference time and peak GPU memory usage on a single NVIDIA A6000 GPU
Software Dependencies	Yes	We use the open-source Mistral-7B-Instruct-v0.31 for all datasets. The BERT-base model (Devlin et al., 2019) is employed for our encoder... and only the decoder part of T5-base (Raffel et al., 2020) is used as our text decoder.
Experiment Setup	Yes	The graph-to-text model is trained on graph-text pairs constructed in Section 4.2 for 1K iterations with a batch size of 512. We employ the Adam W (Loshchilov, 2019) optimizer with a weight decay of 0.05, an initial learning rate of 0.0001, and linear warm-up for the ﬁrst 1% of training steps. For video paragraph captioning, the model is further ﬁne-tuned for 400 iterations on the subset of the constructed graph-text pairs obtained from the Visual Genome paragraph captioning dataset (Krause et al., 2017). For generating the ﬁnal video caption, we apply a beam search with ﬁve beams, a maximum sequence length of 32 and a length penalty of 0.6. For video paragraph caption, which requires more detailed descriptions, is generated using a beam search with three beams, a maximum sequence length of 400, and a repetition penalty of 3.0.