Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models
Authors: Chao Huang, Yushu Shi, Jie Wen, Wei Wang, Yong Xu, Xiaochun Cao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results on the UCF-Crime and XD-Violence datasets demonstrate that Ex-VAD significantly outperforms existing State-of-The-Art methods. |
| Researcher Affiliation | Academia | 1 Shenzhen Campus of Sun Yat-Sen University, School of Cyber Science and Technology, Shenzhen, China 2 Harbin Institute of Technology, School of Computer Science and Technology, Shenzhen, China. Correspondence to: Xiaochun Cao <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using textual explanations and diagrams (Figure 2 and Figure 3) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code, nor does it include links to a code repository in the main text, footnotes, or appendices. |
| Open Datasets | Yes | Datasets. We perform experiments on the UCF-Crime (Sultani et al., 2019) and XD-Violence (Wu et al., 2020) datasets. UCF-Crime consists of 1,900 untrimmed surveillance videos with a total duration of 128 hours, covering 13 real-world anomalies (e.g., abuse, robbery, explosion) and normal activities. XD-Violence contains 4,754 untrimmed videos totaling 217 hours, making it one of the largest multimodal violence detection datasets. |
| Dataset Splits | Yes | UCF-Crime consists of 1,900 untrimmed surveillance videos... In the WSVAD, 1,610 videos are used for training with video-level annotations, while 290 videos are used for testing with frame-level annotations. XD-Violence contains 4,754 untrimmed videos totaling 217 hours... The dataset is divided into 3,954 training videos and 800 testing videos, with video-level labels. |
| Hardware Specification | Yes | All experiments are conducted on a single NVIDIA RTX A100 GPU using Py Torch. |
| Software Dependencies | No | The paper mentions "Py Torch", "CLIP (Vi T-B/16)", "BLIP-2", and "Llama-3.1" as software components and models used, but does not specify their version numbers for replication purposes. |
| Experiment Setup | Yes | Key hyperparameters include: σ = 1, τ = 0.07, context length l = 20, window length in LGT-Adapter (64 for XD-Violence, 8 for UCF-Crime), and λ (1 10 4 for XD-Violence, 1 for UCF-Crime). |