Multi-Scale Contrastive Learning for Video Temporal Grounding
Authors: Thong Thanh Nguyen, Yi Bin, Xiaobao Wu, Zhiyuan Hu, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness of our framework for not only long-form but also short-form video grounding. ... Experiments To validate the effectiveness, we conduct extensive experiments against recent methods for temporal grounding. We also perform ablation study to investigate each component. ... Ablation Study We conduct extensive experiments on TACoS to study the influence of the design choices. |
| Researcher Affiliation | Academia | 1 Institute of Data Science (IDS), National University of Singapore, Singapore 2 Tongji University, China, 3 Nanyang Technological University (NTU), Singapore |
| Pseudocode | No | The paper describes the methodology using mathematical equations and textual descriptions, for example, under 'Cross-scale Contrastive Learning' section. However, there are no clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Following previous works, we work on five challenging datasets of temporal grounding, which belong to two main categories, i.e. 1) Long videos, many queries (Ego4D-NLQ (Grauman et al. 2022), MAD (Soldan et al. 2022), and TACoS (Regneri et al. 2013)) and 2) Short videos, few queries (Activity Net-Captions (Krishna et al. 2017) and Charades-STA (Sigurdsson et al. 2016)). |
| Dataset Splits | No | The paper mentions specific datasets used and refers to evaluation metrics like 'R@K, t Io U', but it does not provide explicit details about how these datasets were split into training, validation, and test sets. It mentions 'video-centric sampling approach (Mu, Mo, and Li 2024)' for Ego4D-NLQ but this is not a general dataset split description. |
| Hardware Specification | No | The paper discusses various pre-trained video and textual features (e.g., SlowFast, BERT, Ego VLP, CLIP), but it does not specify any hardware details like GPU models, CPU types, or memory used to run the experiments. |
| Software Dependencies | No | The paper mentions using several pre-trained models and features (e.g., BERT, SlowFast, CLIP, C3D, GloVe), but it does not specify any software dependencies with version numbers (e.g., Python version, PyTorch version, specific library versions) that would be needed to replicate the experiments. |
| Experiment Setup | Yes | For Ego4D-NLQ, we use pre-trained 1) Slow Fast video features (Feichtenhofer et al. 2019) with BERT textual features (Devlin et al. 2018), and 2) Ego VLP video and textual features (Lin et al. 2022). For testing, we report R@{1, 5}, t Io U = {0.3, 0.5}. ... For both withinscale and cross-scale contrastive learning implementation, we keep the size of the negative sample set N(l) in every level l to be equal to the size of the positive video clips P(l) that correspond to the target video moments. Based upon validation and fair comparison with previous methods, we use ρref = ρwithin = ρcross = 1.0. |