T2L: Efficient Zero-Shot Action Recognition with Temporal Token Learning

Authors: Shahzad Ahmad, Sukalpa Chanda, Yogesh S Rawat

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experiments on nine different benchmark datasets, thoroughly evaluating T2L for zero-shot learning and base-to-novel video action recognition, and also demonstrating its potential for few-shot generalization. Impressively, with merely 5.2 million learnable parameters, T2L can be efficiently trained on a single GPU (with 25x less learnable parameters, 3x reduction in GFLOPs, and 4x improvement in throughput when compared with prior best model), outperforming existing approaches in several evaluations.
Researcher Affiliation Academia Shahzad Ahmad EMAIL Department of Computer Science and Communication, Østfold University College, Norway Sukalpa Chanda EMAIL Department of Computer Science and Communication, Østfold University College, Norway Yogesh S. Rawat EMAIL Center for Research in Computer Vision University of Central Florida
Pseudocode No The paper describes the methodology in narrative text and mathematical equations in Section 3, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code available here https://github.com/Shahzadnit/T2L.
Open Datasets Yes We evaluate our proposed method on nine different video action recognition benchmarks: Kinetics-400 Kay et al. (2017), Kinetics-600 Carreira et al. (2018), HMDB-51 Kuehne et al. (2011), UCF-101 Soomro et al. (2012), Something Something V2 (SSv2) Goyal et al. (2017), UCF-DS Schiappa et al. (2023), UCF-101-P, HMDB-51-PSchiappa et al. (2023), and UCF-101-O Modi et al. (2024).
Dataset Splits Yes In the zero-shot setting, models trained on the Kinetics-400 dataset undergo evaluation on three different cross-datasets: HMDB-51, UCF-101, and Kinetics-600. For HMDB-51 and UCF-101, the methods are assessed across their respective three validation splits, and the top-1 average accuracy is reported. Regarding Kinetics-600, we assess the performance on 220 categories that do not overlap with Kinetics-400, reporting top-1 accuracy. In this setting, single-view inference using 8 frames is applied. For a comprehensive assessment of the generalization capabilities of various approaches, we adopt the base-to-novel generalization setting Rasheed et al. (2023) for video action recognition tasks. The dataset employs three training splits, classifying the total categories into two equal halves. The most frequently occurring classes constitute the base classes, while the rarely occurring categories are designated as the novel classes. The few-shot setting involves creating a general K-shot split, with K samples used in accordance with splits from Rasheed et al. (2023). Specifically, we experiment with 2, 4, 8, and 16 shots on three datasets: HMDB-51, UCF-101, and SSv2. The models are assessed on the first validation split for HMDB-51 and UCF-101, and the full validation split, even on temporally-challenging datasets like SSv2.
Hardware Specification Yes Training is performed on a single NVIDIA A100 80GB GPU, with a batch size of 70, and maintaining an input frame resolution of 224 224 pixels.
Software Dependencies No The paper mentions using a "Vi T-B/16-based CLIP model" and an "AdamW optimizer" but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We use Vi T-B/16-based CLIP model as visual encoder in our experiments. In addition, we also assess our model s generalization using CLIP Vi T-32 and CLIP Vi T-14 backbones. The setup employs only 8 sparsely sampled frames per video, ensuring computational efficiency. We use Adam W optimizer with a base learning rate of 5 10 5 to train out models for 50 epochs with and a weight decay of 0.2. The learning rate warms up for the initial 10% of epochs and then follows a cosine schedule. Training is performed on a single NVIDIA A100 80GB GPU, with a batch size of 70, and maintaining an input frame resolution of 224 224 pixels.