Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval
Authors: Weijia Liu, Jiuxin Cao, Bo Miao, Zhiheng Fu, Xuelin Zhu, Jiawei Ge, Bo Liu, Mehwish Nasim, Ajmal Mian
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics. Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance. ... Experiments on the Charades-STA and QVHighlights benchmarks show that our approach significantly outperforms existing state-of-the-art methods on all metrics. On Charades STA, we surpass the nearest competitor MESM [Liu et al., 2024b] by 4.36% points on the m AP@0.7 metric. |
| Researcher Affiliation | Academia | 1Southeast University 2The University of Adelaide 3The Hong Kong Polytechnic University 4The University of Western Australia EMAIL |
| Pseudocode | No | The paper describes the method using textual descriptions and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing its source code, nor does it provide a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | Datasets. We validate the effectiveness of our method through extensive experiments on two popular datasets: QVHighlights and Charades-STA. QVHighlights [Lei et al., 2021] is designed for moment retrieval and highlight detection... Charades STA [Sigurdsson et al., 2016] is focused on temporal sentence grounding, derived from the Charades dataset. |
| Dataset Splits | Yes | QVHighlights... We follow the original data splits, using the training set for model training and the test set for evaluation... Charades STA [Sigurdsson et al., 2016]... It contains 12,408 training and 3,720 testing moment-sentence pairs... |
| Hardware Specification | Yes | All experiments are conducted on a single RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions using "pre-extracted Slow Fast and CLIP video features, and CLIP text features" and refers to models like "Mamba" and "Transformer", but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions) that are critical for reproducibility. |
| Experiment Setup | Yes | Implementation Details. For a fair comparison, we use pre-extracted Slow Fast and CLIP video features, and CLIP text features, for both datasets, provided by [Lin et al., 2023]. In our DRNet, all encoders constructed using CIO consist of three CIO layers, each with a hidden size of D = 1024. Loss weights are set as: λt = 2, λg L1 = 5, λg iou = 1, λb L1 = 10, λb iou = 1, and λc = 10 for both datasets. For QVHighlights, λintra and λinter are set to 2 each, while for Charades-STA, they are set to 1 and 0.5, respectively. |