reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reliable and Diverse Hierarchical Adapter for Zero-shot Video Classification

Authors: Wenxuan Ge, Peng Huang, Rui Yan, Hongyu Qu, Guosen Xie, Xiangbo Shu

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on four popular video classification benchmarks demonstrate the effectiveness of Hierarchical Adapter. The code is available at https://github.com/Gwxer/Hierarchical-Adapter. [...] Extensive experiments over four benchmarks demonstrate that the reliable and diverse hierarchical adapter achieves superior performance while maintaining competitive computational efficiency.
Researcher Affiliation	Academia	Nanjing University of Science and Technology EMAIL, EMAIL
Pseudocode	Yes	For clarity, we provide the whole cache update process in Algorithm 1 in the form of pseudo-code.
Open Source Code	Yes	Experiments on four popular video classification benchmarks demonstrate the effectiveness of Hierarchical Adapter. The code is available at https://github.com/Gwxer/Hierarchical-Adapter.
Open Datasets	Yes	HMDB-51 [Kuehne et al., 2011] is a small-scale action recognition dataset. [...] UCF-101 [Soomro, 2012] consists of 13,320 videos covering 101 categories, which can be further grouped into five main categories: Body motion, Human-human interactions, Human-object interactions, Playing instruments, and Sports. Kinetics-600 [Carreira et al., 2018] is a large-scale video dataset, containing 600 human action classes, with at least 600 video clips for each action. [...] Activity Net-200 [Fabian Caba Heilbron and Niebles, 2015] is also a large-scale action recognition benchmark
Dataset Splits	No	The paper mentions evaluating on specific datasets (HMDB-51, UCF-101, Kinetics-600, Activity Net-200) and using a validation set for hyperparameter search on Kinetics-400, but does not explicitly provide the training/test/validation split percentages or sample counts for any of these datasets in the main text. While these are standard benchmarks, the specific splits used are not detailed.
Hardware Specification	Yes	All the experiments are conducted using a single NVIDIA 3090 24GB GPU.
Software Dependencies	No	The paper does not explicitly state specific software dependencies with version numbers (e.g., Python, PyTorch, or CUDA versions).
Experiment Setup	Yes	We utilize a pre-trained Vi T-B/16 of CLIP as the foundation model, and the model is not fine-tuned on extra large video datasets. In test-time adaption, we sample T = 32 frames from each test video. We use top-1 accuracy(%) as our evaluation metric. We perform a search for hyperparameter on the validation set of Kinetics-400. In FCR, we select 8 frames based on prediction entropy, and subsequently select 5 frames based on TPD to construct refined video embeddings. When calculating TPD, each frame is divided into 7 7 image patches, and temporal shuffling is applied between adjacent 2 frames. In Algorithm 1, cache size n is set as 10 and similarity threshold τ is 0.95. In Eq. 2, β is 8 according to TDA, and in Eq. 7, µ is set to 0.5.