Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding
Authors: Akash Kumar, Zsolt Kira, Yogesh S Rawat
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of Co SPa L on three benchmark WSTVG datasets, achieving a 3.9% absolute improvement on Vid STG and a 7.9% improvement on HCSTVG-v1. 4 EXPERIMENT DETAILS Datasets: For our experiments, we show results on three benchmark datasets, namely Vid STG(Zhang et al., 2020), HCSTVG-v1 (Tang et al., 2020) and HCSTVG-v2 (Tang et al., 2020). 5 RESULTS AND ANALYSIS Comparison with weakly-supervised baselines: In Tables 2 and 3, we compare our approach with previous weakly-supervised approaches. 5.1 ABLATION STUDY Effectiveness of TPG sub-modules: Firstly, we look into our base model, TPG. From Table 4, we observe that temporal grounding module plays a significant role. |
| Researcher Affiliation | Academia | Akash Kumar University of Central Florida EMAIL Zsolt Kira Georgia Institute of Technology EMAIL Yogesh Singh Rawat University of Central Florida EMAIL |
| Pseudocode | No | The paper describes the methodology and components (TPG, CRG, SPS) in detail using natural language and refers to figures like Figure 3 for an overview, but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project Page: https://akash2907.github.io/cospal_webpage Huggingface link: https://huggingface.co/akashkumar29/cospal |
| Open Datasets | Yes | Datasets: For our experiments, we show results on three benchmark datasets, namely Vid STG(Zhang et al., 2020), HCSTVG-v1 (Tang et al., 2020) and HCSTVG-v2 (Tang et al., 2020). |
| Dataset Splits | Yes | Vid STG distribution comprises of 99,943 videos-sentence pairs, out of which 44,808 are declarative and 55,135 are interrogative. The total number of videos are 10,303 and it contains 80 different type of object categories. Training, validation and test contains 80,684, 8,956 and 10,303 distinct video-sentence pairs respectively and the amount of unique videos for each distribution is 5,436, 602 and 732 respectively. HCSTVG-v1 contains 4500 videos for training and 1160 videos for testing with sentence description referring to human attributes/actions. HCSTVG-v2 dataset extends version 1 to 16,544 videos. The dataset is divided into 10,131 training, 2,000 validation and 4,413 testing videos. |
| Hardware Specification | Yes | E.3 COMPUTE REQUIREMENTS For our work, we run our models on single 16 GB Tesla V100 GPU with a batch size of 32. |
| Software Dependencies | No | The paper mentions using G-DINO, I3D, and BERT models, but does not specify version numbers for general software dependencies like Python, PyTorch, or other libraries used for implementation. |
| Experiment Setup | Yes | E.2 ARCHITECTURE HYPERPARAMS SETTINGS Weakly-GDINO: For weakly-GDINO, we input whole text as the query and frame from video as image input. Frames are sample with a stride of 5. ... Tubelet Phrase Grounding: It contains two modules spatial and temporal grounding. The batch size is set to 32. In spatial grounding module, we use Adam optimizer with a learning rate of 1e-4. The maximum length for number of words in text is set to 25 for HCSTVG. Temporal grounding module had Adam optimizer with learning rate 4e-4. ... Self-paced Scene understanding: In SPS curriculum based learning, we set the upper bound on the number of object tubelets per video. The first stage bound is set to videos with only upto 4 tubelets and it's incremented by 3 in each stage for two more stages. In last stage, the number of tubelets is 10 and it contains all the videos. |