Action-Agnostic Point-Level Supervision for Temporal Action Detection
Authors: Shuhei M. Yoshida, Takashi Shibata, Makoto Terao, Takayuki Okatani, Masashi Sugiyama
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the variety of datasets (THUMOS 14, Fine Action, GTEA, BEOID, and Activity Net 1.3) demonstrate that the proposed approach is competitive with or outperforms prior methods for video-level and point-level supervision in terms of the trade-off between the annotation cost and detection performance. We also find that even training only with annotated frames can achieve competitive results with the previous studies. This suggests the inherent effectiveness of AAPL supervision. |
| Researcher Affiliation | Collaboration | Shuhei M. Yoshida1, Takashi Shibata1, Makoto Terao1, Takayuki Okatani2,3, Masashi Sugiyama3,4 1Visual Intelligence Research Laboratories, NEC Corporation, Kanagawa 211-8666, Japan 2Graduate School of Information Sciences, Tohoku University, Miyagi 980-8579, Japan 3RIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan 4Graduate School of Frontier Sciences, The University of Tokyo, Chiba 277-8561, Japan EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the action detection model and training objectives in Section 3, but it does not present these in structured pseudocode or algorithm blocks. The methods are explained in paragraph text. |
| Open Source Code | Yes | Code https://github.com/smy-nec/AAPL |
| Open Datasets | Yes | Extensive experiments on the variety of datasets (THUMOS 14, Fine Action, GTEA, BEOID, and Activity Net 1.3) demonstrate that the proposed approach is competitive with or outperforms prior methods for video-level and point-level supervision in terms of the trade-off between the annotation cost and detection performance. BEOID (Damen et al. 2014), GTEA (Fathi, Ren, and Rehg 2011), THUMOS 14 (Jiang et al. 2014), Fine Action (Liu et al. 2022), and Activity Net 1.3 (Heilbron et al. 2015). |
| Dataset Splits | Yes | We adopt the training-validation split from Ma et al. (2020). For THUMOS 14, following the convention (Wang et al. 2017; Nguyen et al. 2018), we use the validation set for training and the test set for evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using a 'modified version of the VGG Image Annotator (VIA) (Dutta, Gupta, and Zissermann 2016; Dutta and Zissermann 2019)' for annotation time measurement. However, it does not specify any software libraries or frameworks with version numbers (e.g., PyTorch, TensorFlow, Python) used for implementing the core methodology, which would be necessary for replication. |
| Experiment Setup | No | We defer details of implementation and hyperparameters to the extended version (Yoshida et al. 2024). |