Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing

Authors: Pengcheng Zhao, Jinxing Zhou, Yang Zhao, Dan Guo, Yanxiang Chen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate the effectiveness of the proposed modules and loss functions, resulting in a new state-of-the-art parsing performance. 4 Experiments 4.1 Experimental Setups Dataset & Metrics. Following prior works (Jiang et al. 2022; Lai, Chen, and Yu-Chiang 2023), our experiments are conducted on the Look, Listen, and Parse (LLP) (Tian, Li, and Xu 2020) dataset, which is currently the sole standard dataset used for the AVVP task.
Researcher Affiliation Academia School of Computer Science and Information Engineering, Hefei University of Technology EMAIL, EMAIL
Pseudocode No The paper describes the approach using textual descriptions and diagrams (Figure 2) but does not provide any explicit pseudocode or algorithm blocks.
Open Source Code Yes The code will be publicly available in https://github.com/Pengcheng Zhao1001/MM-CSE.
Open Datasets Yes Following prior works (Jiang et al. 2022; Lai, Chen, and Yu-Chiang 2023), our experiments are conducted on the Look, Listen, and Parse (LLP) (Tian, Li, and Xu 2020) dataset, which is currently the sole standard dataset used for the AVVP task.
Dataset Splits Yes Following the official data splits, the dataset is divided into 10,000 videos for training, 649 for validation, and 1,200 for testing.
Hardware Specification No The paper describes training configurations and feature extraction methods but does not specify any particular hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions using Adam W optimizer and pretrained models like CLIP, R(2+1)D, and CLAP, but does not specify software dependencies with version numbers like Python, PyTorch, or CUDA versions.
Experiment Setup Yes Our model is trained for 60 epochs with a batch size of 64 using Adam W optimizer, with an initial learning rate of 3e-4 and a weight decay of 1e-3. Feature dimensions d1 and d2 are set to 256 and 128, respectively. We use L = 4 stacked FGSE layers. The hyperparameters λ1 and λ2 in Eq. 12 are empirically set to 0.1.