TEST-V: TEst-time Support-set Tuning for Zero-shot Video Classification

Authors: Rui Yan, Jin Wang, Hongyu Qu, Xiaoyu Du, Dong Zhang, Jinhui Tang, Tieniu Tan

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental TEST-V achieves state-of-theart results across four benchmarks and shows good interpretability. Extensive experimental results show that TEST-V improves the ... by 2.98%, 2.15%, and 1.83% absolute average accuracy respectively across four benchmarks. We evaluate the effectiveness of the proposed method on four popular video benchmarks, i.e., HMDB-51 [Kuehne et al., 2011], UCF-101 [Soomro et al., 2012], Kinetics-600 [Carreira et al., 2018], Activity Net [Fabian Caba Heilbron and Niebles, 2015].
Researcher Affiliation Academia Rui Yan1,2 , Jin Wang1 , Hongyu Qu1 , Xiaoyu Du1 , Dong Zhang3 , Jinhui Tang1 and Tieniu Tan2 1Nanjing University of Science and Technology 2Nanjing University 3Hong Kong University of Science and Technology EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using prose and diagrams (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that source code for the described methodology is being released or provide a link to a repository. There is an arXiv link to an extended version of the paper, but no direct code link.
Open Datasets Yes We evaluate the effectiveness of the proposed method on four popular video benchmarks, i.e., HMDB-51 [Kuehne et al., 2011], UCF-101 [Soomro et al., 2012], Kinetics-600 [Carreira et al., 2018], Activity Net [Fabian Caba Heilbron and Niebles, 2015].
Dataset Splits Yes We evaluate the effectiveness of the proposed method on four popular video benchmarks, i.e., HMDB-51 [Kuehne et al., 2011], UCF-101 [Soomro et al., 2012], Kinetics-600 [Carreira et al., 2018], Activity Net [Fabian Caba Heilbron and Niebles, 2015]. These are well-known benchmark datasets with standard, predefined splits for evaluation.
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions pre-trained Vision Language Models (VLMs) like CLIP [Radford et al., 2021], BIKE [Wu et al., 2023], Vi Fi-CLIP [Rasheed et al., 2023], and LLMs (Chat GPT [Open AI, 2023], Gemini [Team et al., 2024], Llama-3 [AI@Meta, 2024], Claude-3 [Anthropic, 2024]) and Text-to-Video models (La Vie [Wang et al., 2023b], Show-1 [Zhang et al., 2023a], Hi Gen [Qing et al., 2024], TF-T2V [Wang et al., 2024], Model Scope T2V [Wang et al., 2023a]) but does not provide specific version numbers for these software components or other ancillary software dependencies required for reproduction.
Experiment Setup No The paper discusses the methodology, components (MSD, TSE), and ablations of certain parameters like 'n' (number of repeatedly generated videos) for support-set construction, and different sampling strategies for multi-scale temporal tuning. However, it does not provide specific hyperparameters such as learning rates, batch sizes, optimizers, or other detailed training configurations used for fine-tuning or optimization processes.