reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts

Authors: Taein Son, Soo Won Seo, Jisong Kim, Seok Hwan Lee, Jun Won Choi

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation on three prominent VAD benchmarks AVA, UCF101-24, and JHMDB51-21 demonstrates that incorporating multimodal information significantly enhances performance, setting new state-of-the-art performances in the field.
Researcher Affiliation	Academia	1Hanyang University 2Seoul National University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the architecture and methodology using descriptive text and figures, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	Yes	Code https://github.com/taeiin/AAAI2025-Jo VALE
Open Datasets	Yes	We evaluate Jo VALE on three standard VAD datasets: AVA (Gu et al. 2018), UCF101-24 (Soomro, Zamir, and Shah 2012), and JHMDB51-21 (Jhuang et al. 2013).
Dataset Splits	Yes	AVA consists of 299 15-minute movie clips, with 235 for training and 64 for validation. We evaluate our approach on 60 action classes in AVA v2.2. UCF101-24, a subset of UCF101, contains 24 sport action classes with 3,207 instances, and our method is evaluated on the first split. JHMDB51-21, a subset of JHMDB51, includes 928 trimmed video clips spanning 21 action classes. We report the average performance across the three standard splits of the dataset.
Hardware Specification	Yes	Training was conducted for 8 epochs with a batch size of 16, utilizing four NVIDIA Ge Force RTX 3090 GPUs.
Software Dependencies	No	The paper mentions 'sigmoid focal loss' and 'Adam W optimizer' but does not specify version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup	Yes	The AMFN consists of L = 6 Transformer layers. When conducting temporal alignment and dealignment within the MFA module, the temporal dimension Tc is aligned to match that of the visual images, Tv. Specifically, Tv is set to 4 when utilizing the Slow Fast (Feichtenhofer et al. 2019) architecture and 8 with the Vi T (Dosovitskiy et al. 2020). For audio data, the spectrograms are processed through a convolutional layer with a kernel size of P = 16 and a stride of S = 10, yielding audio embeddings with a temporal length of Ta = 20. Regarding the scene-descriptive features, the number of input frames inputted into BLIP is set at Ts = 4. The hyperparameter D, representing the Transformer embedding size, is set to 256. The entire model was trained using sigmoid focal loss for action classification. The Adam W optimizer was employed with a weight decay of 1e-4. Initial learning rates were set to 1e-5 for the video backbone and 1e-4 for the other networks, with a tenfold reduction applied at the 7th epoch. Training was conducted for 8 epochs with a batch size of 16.