VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval
Authors: Peng Wu, Wanshun Su, Xiangteng He, Peng Wang, Yanning Zhang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrates high efficacy of Var CMP in both video-text and video-audio VAR tasks, achieving significant improvements on both text-video (UCFCrime-AR) and audio-video (XDViolence-AR) datasets against the best competitors by 5.0% and 5.3% R@1. |
| Researcher Affiliation | Academia | 1School of Computer Science, Northwestern Polytechnical University, China 2Wangxuan Institute of Computer Technology, Peking University, China EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and formulas but does not include a distinct section labeled 'Pseudocode' or 'Algorithm', nor does it present structured, code-like steps for any procedure. |
| Open Source Code | No | The paper does not provide an explicit statement regarding the release of source code, nor does it include a link to a code repository. |
| Open Datasets | Yes | We conduct experiments on two popular VAR datasets, i.e., UCFCrime-AR and XDViolence-AR (Wu et al. 2024a) for video-text and video-audio VAR tasks. |
| Dataset Splits | Yes | We conduct experiments on two popular VAR datasets, i.e., UCFCrime-AR and XDViolence-AR (Wu et al. 2024a) for video-text and video-audio VAR tasks. Following (Wu et al. 2024a), we use the rank-based metric for evaluation, i.e., Recall at K (R@K, K=1, 5, 10), Median Rank (Md R) to measure the overall performance. |
| Hardware Specification | Yes | For model training, Var CMP is trained on a single NVIDIA RTX 4090 GPU using PyTorch. |
| Software Dependencies | No | The paper mentions 'PyTorch' as the framework used for training but does not provide a specific version number for it or other key software dependencies. |
| Experiment Setup | Yes | For network structure, image and text encoders are adopted from pre-training CLIP (VIT-B/32), audio encoder is adopted from pre-training CLAP (630k-audioset-fusion-best). The dimension C of visual, text, and audio features is set to 512. We choose M = 4 most salient patches for each frame in our patch selection module on the datasets. The transformer layer is set to 1, the number of heads to 8, and the dimension of FFN to 1024. We sample Nv = 32 frames per video, set the max length of text and audio query as 32. For model training, Var CMP is trained on a single NVIDIA RTX 4090 GPU using PyTorch. We use Adam W as the optimizer with batch size of 8 with the learning rate of 1e-4 and total epoch of 15. |