reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval

Authors: Peng Wu, Wanshun Su, Xiangteng He, Peng Wang, Yanning Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrates high efficacy of Var CMP in both video-text and video-audio VAR tasks, achieving significant improvements on both text-video (UCFCrime-AR) and audio-video (XDViolence-AR) datasets against the best competitors by 5.0% and 5.3% R@1.
Researcher Affiliation	Academia	1School of Computer Science, Northwestern Polytechnical University, China 2Wangxuan Institute of Computer Technology, Peking University, China EMAIL, EMAIL
Pseudocode	No	The paper describes methods and formulas but does not include a distinct section labeled 'Pseudocode' or 'Algorithm', nor does it present structured, code-like steps for any procedure.
Open Source Code	No	The paper does not provide an explicit statement regarding the release of source code, nor does it include a link to a code repository.
Open Datasets	Yes	We conduct experiments on two popular VAR datasets, i.e., UCFCrime-AR and XDViolence-AR (Wu et al. 2024a) for video-text and video-audio VAR tasks.
Dataset Splits	Yes	We conduct experiments on two popular VAR datasets, i.e., UCFCrime-AR and XDViolence-AR (Wu et al. 2024a) for video-text and video-audio VAR tasks. Following (Wu et al. 2024a), we use the rank-based metric for evaluation, i.e., Recall at K (R@K, K=1, 5, 10), Median Rank (Md R) to measure the overall performance.
Hardware Specification	Yes	For model training, Var CMP is trained on a single NVIDIA RTX 4090 GPU using PyTorch.
Software Dependencies	No	The paper mentions 'PyTorch' as the framework used for training but does not provide a specific version number for it or other key software dependencies.
Experiment Setup	Yes	For network structure, image and text encoders are adopted from pre-training CLIP (VIT-B/32), audio encoder is adopted from pre-training CLAP (630k-audioset-fusion-best). The dimension C of visual, text, and audio features is set to 512. We choose M = 4 most salient patches for each frame in our patch selection module on the datasets. The transformer layer is set to 1, the number of heads to 8, and the dimension of FFN to 1024. We sample Nv = 32 frames per video, set the max length of text and audio query as 32. For model training, Var CMP is trained on a single NVIDIA RTX 4090 GPU using PyTorch. We use Adam W as the optimizer with batch size of 8 with the learning rate of 1e-4 and total epoch of 15.