Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval

Authors: Weijia Liu, Jiuxin Cao, Bo Miao, Zhiheng Fu, Xuelin Zhu, Jiawei Ge, Bo Liu, Mehwish Nasim, Ajmal Mian

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics. Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance. ... Experiments on the Charades-STA and QVHighlights benchmarks show that our approach significantly outperforms existing state-of-the-art methods on all metrics. On Charades STA, we surpass the nearest competitor MESM [Liu et al., 2024b] by 4.36% points on the m AP@0.7 metric.
Researcher Affiliation Academia 1Southeast University 2The University of Adelaide 3The Hong Kong Polytechnic University 4The University of Western Australia EMAIL
Pseudocode No The paper describes the method using textual descriptions and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing its source code, nor does it provide a direct link to a code repository for the methodology described.
Open Datasets Yes Datasets. We validate the effectiveness of our method through extensive experiments on two popular datasets: QVHighlights and Charades-STA. QVHighlights [Lei et al., 2021] is designed for moment retrieval and highlight detection... Charades STA [Sigurdsson et al., 2016] is focused on temporal sentence grounding, derived from the Charades dataset.
Dataset Splits Yes QVHighlights... We follow the original data splits, using the training set for model training and the test set for evaluation... Charades STA [Sigurdsson et al., 2016]... It contains 12,408 training and 3,720 testing moment-sentence pairs...
Hardware Specification Yes All experiments are conducted on a single RTX 3090 GPU.
Software Dependencies No The paper mentions using "pre-extracted Slow Fast and CLIP video features, and CLIP text features" and refers to models like "Mamba" and "Transformer", but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions) that are critical for reproducibility.
Experiment Setup Yes Implementation Details. For a fair comparison, we use pre-extracted Slow Fast and CLIP video features, and CLIP text features, for both datasets, provided by [Lin et al., 2023]. In our DRNet, all encoders constructed using CIO consist of three CIO layers, each with a hidden size of D = 1024. Loss weights are set as: λt = 2, λg L1 = 5, λg iou = 1, λb L1 = 10, λb iou = 1, and λc = 10 for both datasets. For QVHighlights, λintra and λinter are set to 2 each, while for Charades-STA, they are set to 1 and 0.5, respectively.