Reliable and Diverse Hierarchical Adapter for Zero-shot Video Classification
Authors: Wenxuan Ge, Peng Huang, Rui Yan, Hongyu Qu, Guosen Xie, Xiangbo Shu
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on four popular video classification benchmarks demonstrate the effectiveness of Hierarchical Adapter. The code is available at https://github.com/Gwxer/Hierarchical-Adapter. [...] Extensive experiments over four benchmarks demonstrate that the reliable and diverse hierarchical adapter achieves superior performance while maintaining competitive computational efficiency. |
| Researcher Affiliation | Academia | Nanjing University of Science and Technology EMAIL, EMAIL |
| Pseudocode | Yes | For clarity, we provide the whole cache update process in Algorithm 1 in the form of pseudo-code. |
| Open Source Code | Yes | Experiments on four popular video classification benchmarks demonstrate the effectiveness of Hierarchical Adapter. The code is available at https://github.com/Gwxer/Hierarchical-Adapter. |
| Open Datasets | Yes | HMDB-51 [Kuehne et al., 2011] is a small-scale action recognition dataset. [...] UCF-101 [Soomro, 2012] consists of 13,320 videos covering 101 categories, which can be further grouped into five main categories: Body motion, Human-human interactions, Human-object interactions, Playing instruments, and Sports. Kinetics-600 [Carreira et al., 2018] is a large-scale video dataset, containing 600 human action classes, with at least 600 video clips for each action. [...] Activity Net-200 [Fabian Caba Heilbron and Niebles, 2015] is also a large-scale action recognition benchmark |
| Dataset Splits | No | The paper mentions evaluating on specific datasets (HMDB-51, UCF-101, Kinetics-600, Activity Net-200) and using a validation set for hyperparameter search on Kinetics-400, but does not explicitly provide the training/test/validation split percentages or sample counts for any of these datasets in the main text. While these are standard benchmarks, the specific splits used are not detailed. |
| Hardware Specification | Yes | All the experiments are conducted using a single NVIDIA 3090 24GB GPU. |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies with version numbers (e.g., Python, PyTorch, or CUDA versions). |
| Experiment Setup | Yes | We utilize a pre-trained Vi T-B/16 of CLIP as the foundation model, and the model is not fine-tuned on extra large video datasets. In test-time adaption, we sample T = 32 frames from each test video. We use top-1 accuracy(%) as our evaluation metric. We perform a search for hyperparameter on the validation set of Kinetics-400. In FCR, we select 8 frames based on prediction entropy, and subsequently select 5 frames based on TPD to construct refined video embeddings. When calculating TPD, each frame is divided into 7 7 image patches, and temporal shuffling is applied between adjacent 2 frames. In Algorithm 1, cache size n is set as 10 and similarity threshold τ is 0.95. In Eq. 2, β is 8 according to TDA, and in Eq. 7, µ is set to 0.5. |