Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models

Authors: Chao Huang, Yushu Shi, Jie Wen, Wei Wang, Yong Xu, Xiaochun Cao

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on the UCF-Crime and XD-Violence datasets demonstrate that Ex-VAD significantly outperforms existing State-of-The-Art methods.
Researcher Affiliation Academia 1 Shenzhen Campus of Sun Yat-Sen University, School of Cyber Science and Technology, Shenzhen, China 2 Harbin Institute of Technology, School of Computer Science and Technology, Shenzhen, China. Correspondence to: Xiaochun Cao <EMAIL>.
Pseudocode No The paper describes the methodology using textual explanations and diagrams (Figure 2 and Figure 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code, nor does it include links to a code repository in the main text, footnotes, or appendices.
Open Datasets Yes Datasets. We perform experiments on the UCF-Crime (Sultani et al., 2019) and XD-Violence (Wu et al., 2020) datasets. UCF-Crime consists of 1,900 untrimmed surveillance videos with a total duration of 128 hours, covering 13 real-world anomalies (e.g., abuse, robbery, explosion) and normal activities. XD-Violence contains 4,754 untrimmed videos totaling 217 hours, making it one of the largest multimodal violence detection datasets.
Dataset Splits Yes UCF-Crime consists of 1,900 untrimmed surveillance videos... In the WSVAD, 1,610 videos are used for training with video-level annotations, while 290 videos are used for testing with frame-level annotations. XD-Violence contains 4,754 untrimmed videos totaling 217 hours... The dataset is divided into 3,954 training videos and 800 testing videos, with video-level labels.
Hardware Specification Yes All experiments are conducted on a single NVIDIA RTX A100 GPU using Py Torch.
Software Dependencies No The paper mentions "Py Torch", "CLIP (Vi T-B/16)", "BLIP-2", and "Llama-3.1" as software components and models used, but does not specify their version numbers for replication purposes.
Experiment Setup Yes Key hyperparameters include: σ = 1, τ = 0.07, context length l = 20, window length in LGT-Adapter (64 for XD-Violence, 8 for UCF-Crime), and λ (1 10 4 for XD-Violence, 1 for UCF-Crime).