reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TEST-V: TEst-time Support-set Tuning for Zero-shot Video Classification

Authors: Rui Yan, Jin Wang, Hongyu Qu, Xiaoyu Du, Dong Zhang, Jinhui Tang, Tieniu Tan

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	TEST-V achieves state-of-theart results across four benchmarks and shows good interpretability. Extensive experimental results show that TEST-V improves the ... by 2.98%, 2.15%, and 1.83% absolute average accuracy respectively across four benchmarks. We evaluate the effectiveness of the proposed method on four popular video benchmarks, i.e., HMDB-51 [Kuehne et al., 2011], UCF-101 [Soomro et al., 2012], Kinetics-600 [Carreira et al., 2018], Activity Net [Fabian Caba Heilbron and Niebles, 2015].
Researcher Affiliation	Academia	Rui Yan1,2 , Jin Wang1 , Hongyu Qu1 , Xiaoyu Du1 , Dong Zhang3 , Jinhui Tang1 and Tieniu Tan2 1Nanjing University of Science and Technology 2Nanjing University 3Hong Kong University of Science and Technology EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using prose and diagrams (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not explicitly state that source code for the described methodology is being released or provide a link to a repository. There is an arXiv link to an extended version of the paper, but no direct code link.
Open Datasets	Yes	We evaluate the effectiveness of the proposed method on four popular video benchmarks, i.e., HMDB-51 [Kuehne et al., 2011], UCF-101 [Soomro et al., 2012], Kinetics-600 [Carreira et al., 2018], Activity Net [Fabian Caba Heilbron and Niebles, 2015].
Dataset Splits	Yes	We evaluate the effectiveness of the proposed method on four popular video benchmarks, i.e., HMDB-51 [Kuehne et al., 2011], UCF-101 [Soomro et al., 2012], Kinetics-600 [Carreira et al., 2018], Activity Net [Fabian Caba Heilbron and Niebles, 2015]. These are well-known benchmark datasets with standard, predefined splits for evaluation.
Hardware Specification	No	The paper does not specify any particular hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper mentions pre-trained Vision Language Models (VLMs) like CLIP [Radford et al., 2021], BIKE [Wu et al., 2023], Vi Fi-CLIP [Rasheed et al., 2023], and LLMs (Chat GPT [Open AI, 2023], Gemini [Team et al., 2024], Llama-3 [AI@Meta, 2024], Claude-3 [Anthropic, 2024]) and Text-to-Video models (La Vie [Wang et al., 2023b], Show-1 [Zhang et al., 2023a], Hi Gen [Qing et al., 2024], TF-T2V [Wang et al., 2024], Model Scope T2V [Wang et al., 2023a]) but does not provide specific version numbers for these software components or other ancillary software dependencies required for reproduction.
Experiment Setup	No	The paper discusses the methodology, components (MSD, TSE), and ablations of certain parameters like 'n' (number of repeatedly generated videos) for support-set construction, and different sampling strategies for multi-scale temporal tuning. However, it does not provide specific hyperparameters such as learning rates, batch sizes, optimizers, or other detailed training configurations used for fine-tuning or optimization processes.