JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts

Authors: Taein Son, Soo Won Seo, Jisong Kim, Seok Hwan Lee, Jun Won Choi

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation on three prominent VAD benchmarks AVA, UCF101-24, and JHMDB51-21 demonstrates that incorporating multimodal information significantly enhances performance, setting new state-of-the-art performances in the field.
Researcher Affiliation Academia 1Hanyang University 2Seoul National University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the architecture and methodology using descriptive text and figures, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code Yes Code https://github.com/taeiin/AAAI2025-Jo VALE
Open Datasets Yes We evaluate Jo VALE on three standard VAD datasets: AVA (Gu et al. 2018), UCF101-24 (Soomro, Zamir, and Shah 2012), and JHMDB51-21 (Jhuang et al. 2013).
Dataset Splits Yes AVA consists of 299 15-minute movie clips, with 235 for training and 64 for validation. We evaluate our approach on 60 action classes in AVA v2.2. UCF101-24, a subset of UCF101, contains 24 sport action classes with 3,207 instances, and our method is evaluated on the first split. JHMDB51-21, a subset of JHMDB51, includes 928 trimmed video clips spanning 21 action classes. We report the average performance across the three standard splits of the dataset.
Hardware Specification Yes Training was conducted for 8 epochs with a batch size of 16, utilizing four NVIDIA Ge Force RTX 3090 GPUs.
Software Dependencies No The paper mentions 'sigmoid focal loss' and 'Adam W optimizer' but does not specify version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup Yes The AMFN consists of L = 6 Transformer layers. When conducting temporal alignment and dealignment within the MFA module, the temporal dimension Tc is aligned to match that of the visual images, Tv. Specifically, Tv is set to 4 when utilizing the Slow Fast (Feichtenhofer et al. 2019) architecture and 8 with the Vi T (Dosovitskiy et al. 2020). For audio data, the spectrograms are processed through a convolutional layer with a kernel size of P = 16 and a stride of S = 10, yielding audio embeddings with a temporal length of Ta = 20. Regarding the scene-descriptive features, the number of input frames inputted into BLIP is set at Ts = 4. The hyperparameter D, representing the Transformer embedding size, is set to 256. The entire model was trained using sigmoid focal loss for action classification. The Adam W optimizer was employed with a weight decay of 1e-4. Initial learning rates were set to 1e-5 for the video backbone and 1e-4 for the other networks, with a tenfold reduction applied at the 7th epoch. Training was conducted for 8 epochs with a batch size of 16.