DiffDVC: Accurate Event Detection for Dense Video Captioning via Diffusion Models

Authors: Wei Chen, Jianwei Niu, Xuefeng Liu, Zhendong Wang, Shaojie Tang, Guogang Zhu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on Activity Net-1.3, Activity Net Captions, and You Cook2 datasets show Diff DVC achieving superior performance. To explore Diff DVC in detail, we conduct ablation studies using the Activity Net Captions and You Cook2 datasets.
Researcher Affiliation Academia 1State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China 2Zhongguancun Laboratory, Beijing, China 3Zhengzhou University Research Institute of Industrial Technology, Zhengzhou University, Zhengzhou, China 4Department of Management Science and Systems, University at Buffalo, Buffalo, New York, United States EMAIL EMAIL
Pseudocode No The paper describes methods using mathematical equations and text, but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes We perform experiments using the Activity Net-1.3 (Caba Heilbron et al. 2015), Activity Net Captions (Krishna et al. 2017), and You Cook2 (Zhou, Xu, and Corso 2018) datasets. By gradually scaling down ground-truth object boxes in the COCO validation dataset (Lin et al. 2014) and ground-truth event proposals in the THUMOS14 validation dataset (Idrees et al. 2017).
Dataset Splits Yes Activity Net-1.3 has 10,024 training, 4,926 validation, and 5,044 testing videos. Activity Net Captions includes 10,009 training, 4,917 validation, and 5044 testing videos. You Cook2 contains 1,333 training, 457 validation, and 210 testing videos. Due to the inaccessibility of the testing sets of these datasets, we evaluate Diff DVC using the validation sets following previous methods.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions using the Adam optimizer but does not specify versions for any other software libraries, frameworks, or programming languages.
Experiment Setup Yes For Activity Net-1.3 and Activity Net Captions, we configure the number of event proposals or queries N to be 15, while for You Cook2, N is set to 100. During inference, the sample steps in DDIM are set to 1. We train word embeddings with 512 dimensions from scratch. The signal scaling factor is 1.0. We apply the Adam optimizer and the learning rate is initialized to 5e-5.