Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

Authors: Shiping Ge, Qiang Chen, Zhiwei Jiang, Yafeng Yin, Liu Qin, Ziyao Chen, Qing Gu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the public datasets demonstrate that our method outperforms existing weakly-supervised methods and achieves competitive results compared to fully-supervised methods. ... Extensive experiments conducted on the public datasets demonstrate the effectiveness of our method and each of its components. ... Table 1 presents a comparison of our proposed method with existing fully-supervised and weakly-supervised methods on the Activity Net Caption dataset. ... In this section, we conduct an ablation study to investigate the contribution of each component in our proposed method.
Researcher Affiliation Collaboration 1State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2Tencent We Chat, Guangzhou, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the proposed method in text and through diagrams (Figure 2, Figure 3), but does not contain any explicit sections or figures labeled as 'Pseudocode' or 'Algorithm' with structured, code-like steps.
Open Source Code Yes Code https://github.com/Shiping Ge/ILCACM
Open Datasets Yes We evaluate our proposed method and baseline methods on the Activity Net Captions dataset. ... Besides, we also conduct experiments on Vi TT (Huang et al. 2020) and You Cook2 (Zhou, Xu, and Corso 2018), which are two DVC datasets and have never been used for the evaluation of WSDVC methods.
Dataset Splits Yes We evaluate our proposed method and baseline methods on the Activity Net Captions dataset. ... To make a fair comparison with previous methods, we use the evaluation tool provided by the 2018 Activity Net Captions Challenge, which measures the capability to localize and describe events.
Hardware Specification Yes We train the model for 10 epochs for the captioning stage and 10 epochs for the localizing stage on 8 Tesla V100 GPUs with a batch size of 8.
Software Dependencies No The paper mentions using the 'Distilled-GPT2 model' and the 'AdamW optimizer' but does not provide specific version numbers for these or other key software components, libraries, or frameworks (e.g., Python version, PyTorch version, etc.).
Experiment Setup Yes We set the number of transformer blocks in the video-level temporal encoder and cross-modal localizer to 6 and 1, respectively. The number of attention heads, dimension of hidden states, and feed-forward layers are set to 12, 768, and 2, 048 in all transformer blocks, respectively. We utilize the Distilled-GPT2 model for the construction of our caption decoder model. For the training of the model, we adopt the Adam W (Loshchilov and Hutter 2017) optimizer with an initial learning rate of 1e-4 with a warmup rate of 0.1. We train the model for 10 epochs for the captioning stage and 10 epochs for the localizing stage on 8 Tesla V100 GPUs with a batch size of 8.