Reliable and Diverse Hierarchical Adapter for Zero-shot Video Classification

Authors: Wenxuan Ge, Peng Huang, Rui Yan, Hongyu Qu, Guosen Xie, Xiangbo Shu

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on four popular video classification benchmarks demonstrate the effectiveness of Hierarchical Adapter. The code is available at https://github.com/Gwxer/Hierarchical-Adapter. [...] Extensive experiments over four benchmarks demonstrate that the reliable and diverse hierarchical adapter achieves superior performance while maintaining competitive computational efficiency.
Researcher Affiliation Academia Nanjing University of Science and Technology EMAIL, EMAIL
Pseudocode Yes For clarity, we provide the whole cache update process in Algorithm 1 in the form of pseudo-code.
Open Source Code Yes Experiments on four popular video classification benchmarks demonstrate the effectiveness of Hierarchical Adapter. The code is available at https://github.com/Gwxer/Hierarchical-Adapter.
Open Datasets Yes HMDB-51 [Kuehne et al., 2011] is a small-scale action recognition dataset. [...] UCF-101 [Soomro, 2012] consists of 13,320 videos covering 101 categories, which can be further grouped into five main categories: Body motion, Human-human interactions, Human-object interactions, Playing instruments, and Sports. Kinetics-600 [Carreira et al., 2018] is a large-scale video dataset, containing 600 human action classes, with at least 600 video clips for each action. [...] Activity Net-200 [Fabian Caba Heilbron and Niebles, 2015] is also a large-scale action recognition benchmark
Dataset Splits No The paper mentions evaluating on specific datasets (HMDB-51, UCF-101, Kinetics-600, Activity Net-200) and using a validation set for hyperparameter search on Kinetics-400, but does not explicitly provide the training/test/validation split percentages or sample counts for any of these datasets in the main text. While these are standard benchmarks, the specific splits used are not detailed.
Hardware Specification Yes All the experiments are conducted using a single NVIDIA 3090 24GB GPU.
Software Dependencies No The paper does not explicitly state specific software dependencies with version numbers (e.g., Python, PyTorch, or CUDA versions).
Experiment Setup Yes We utilize a pre-trained Vi T-B/16 of CLIP as the foundation model, and the model is not fine-tuned on extra large video datasets. In test-time adaption, we sample T = 32 frames from each test video. We use top-1 accuracy(%) as our evaluation metric. We perform a search for hyperparameter on the validation set of Kinetics-400. In FCR, we select 8 frames based on prediction entropy, and subsequently select 5 frames based on TPD to construct refined video embeddings. When calculating TPD, each frame is divided into 7 7 image patches, and temporal shuffling is applied between adjacent 2 frames. In Algorithm 1, cache size n is set as 10 and similarity threshold τ is 0.95. In Eq. 2, β is 8 according to TDA, and in Eq. 7, µ is set to 0.5.