Video Summarization Using Denoising Diffusion Probabilistic Model

Authors: Zirui Shang, Yubo Zhu, Hongxi Li, Shuo Yang, Xinxiao Wu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on various datasets (TVSum, Sum Me, and FPVSum) demonstrate the effectiveness of our method. [...] We conduct experiments on three benchmark datasets: TVSum (Song et al. 2015), Sum Me (Gong et al. 2014) and FPVSum (Ho, Chiu, and Wang 2018). We compare our method with several state-of-the-art methods under different settings... Ablation Study To perform an in-depth analysis of each individual component of our method, we conduct extensive ablation studies on TVSum, Sum Me, and FPVSum.
Researcher Affiliation Academia 1Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology, China 2Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University, China EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Training Process Input: Video features with annotated importance scores. Output: Parameters θ of the noise predictor network. [...] Algorithm 2: Testing Process Input: Video features. Output: Video summary.
Open Source Code No The paper does not explicitly provide any links to source code repositories, nor does it contain a clear statement about making the code publicly available.
Open Datasets Yes We conduct experiments on three benchmark datasets: TVSum (Song et al. 2015), Sum Me (Gong et al. 2014) and FPVSum (Ho, Chiu, and Wang 2018)
Dataset Splits Yes Following the protocol in (Zhang et al. 2016), we build three settings for TVSum and Sum Me: canonical, augmented and transfer. canonical is the standard supervised learning setting that divides the dataset into a training set and a testing set. [...] Following the protocol in (Ho, Chiu, and Wang 2018), we build the FPVSum setting as a supplement to the transfer setting. We divide videos with different points of view, introducing the third-person videos in TVSum and Sum Me into the training set, and the first-person videos into the testing set. We perform validation experiments with 5 randomly created data splits and report the average results.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper states 'The training and testing processes are implemented using Pytorch.' but does not provide specific version numbers for Pytorch or any other software dependencies.
Experiment Setup Yes In training, the maximum step of noise addition is set to 200, and the ground-truth of importance scores are normalized to the range of -1 to 1 before adding Gaussian noise. In testing, the denoising step is set to 200, and the inverse process of training is conducted, which scales the generated importance scores to the range of 0 to 1. In addition, we use Adam as the optimizer and set the learning rate to 0.0002 and the weight decay to 0.01. The model is trained in 100 epochs, and a warmup strategy is used in the first 10 epochs.