reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

Authors: Shiping Ge, Qiang Chen, Zhiwei Jiang, Yafeng Yin, Liu Qin, Ziyao Chen, Qing Gu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on the public datasets demonstrate that our method outperforms existing weakly-supervised methods and achieves competitive results compared to fully-supervised methods. ... Extensive experiments conducted on the public datasets demonstrate the effectiveness of our method and each of its components. ... Table 1 presents a comparison of our proposed method with existing fully-supervised and weakly-supervised methods on the Activity Net Caption dataset. ... In this section, we conduct an ablation study to investigate the contribution of each component in our proposed method.
Researcher Affiliation	Collaboration	1State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2Tencent We Chat, Guangzhou, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the proposed method in text and through diagrams (Figure 2, Figure 3), but does not contain any explicit sections or figures labeled as 'Pseudocode' or 'Algorithm' with structured, code-like steps.
Open Source Code	Yes	Code https://github.com/Shiping Ge/ILCACM
Open Datasets	Yes	We evaluate our proposed method and baseline methods on the Activity Net Captions dataset. ... Besides, we also conduct experiments on Vi TT (Huang et al. 2020) and You Cook2 (Zhou, Xu, and Corso 2018), which are two DVC datasets and have never been used for the evaluation of WSDVC methods.
Dataset Splits	Yes	We evaluate our proposed method and baseline methods on the Activity Net Captions dataset. ... To make a fair comparison with previous methods, we use the evaluation tool provided by the 2018 Activity Net Captions Challenge, which measures the capability to localize and describe events.
Hardware Specification	Yes	We train the model for 10 epochs for the captioning stage and 10 epochs for the localizing stage on 8 Tesla V100 GPUs with a batch size of 8.
Software Dependencies	No	The paper mentions using the 'Distilled-GPT2 model' and the 'AdamW optimizer' but does not provide specific version numbers for these or other key software components, libraries, or frameworks (e.g., Python version, PyTorch version, etc.).
Experiment Setup	Yes	We set the number of transformer blocks in the video-level temporal encoder and cross-modal localizer to 6 and 1, respectively. The number of attention heads, dimension of hidden states, and feed-forward layers are set to 12, 768, and 2, 048 in all transformer blocks, respectively. We utilize the Distilled-GPT2 model for the construction of our caption decoder model. For the training of the model, we adopt the Adam W (Loshchilov and Hutter 2017) optimizer with an initial learning rate of 1e-4 with a warmup rate of 0.1. We train the model for 10 epochs for the captioning stage and 10 epochs for the localizing stage on 8 Tesla V100 GPUs with a batch size of 8.