Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
Authors: Pengcheng Zhao, Jinxing Zhou, Yang Zhao, Dan Guo, Yanxiang Chen
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments validate the effectiveness of the proposed modules and loss functions, resulting in a new state-of-the-art parsing performance. 4 Experiments 4.1 Experimental Setups Dataset & Metrics. Following prior works (Jiang et al. 2022; Lai, Chen, and Yu-Chiang 2023), our experiments are conducted on the Look, Listen, and Parse (LLP) (Tian, Li, and Xu 2020) dataset, which is currently the sole standard dataset used for the AVVP task. |
| Researcher Affiliation | Academia | School of Computer Science and Information Engineering, Hefei University of Technology EMAIL, EMAIL |
| Pseudocode | No | The paper describes the approach using textual descriptions and diagrams (Figure 2) but does not provide any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code will be publicly available in https://github.com/Pengcheng Zhao1001/MM-CSE. |
| Open Datasets | Yes | Following prior works (Jiang et al. 2022; Lai, Chen, and Yu-Chiang 2023), our experiments are conducted on the Look, Listen, and Parse (LLP) (Tian, Li, and Xu 2020) dataset, which is currently the sole standard dataset used for the AVVP task. |
| Dataset Splits | Yes | Following the official data splits, the dataset is divided into 10,000 videos for training, 649 for validation, and 1,200 for testing. |
| Hardware Specification | No | The paper describes training configurations and feature extraction methods but does not specify any particular hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and pretrained models like CLIP, R(2+1)D, and CLAP, but does not specify software dependencies with version numbers like Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | Our model is trained for 60 epochs with a batch size of 64 using Adam W optimizer, with an initial learning rate of 3e-4 and a weight decay of 1e-3. Feature dimensions d1 and d2 are set to 256 and 128, respectively. We use L = 4 stacked FGSE layers. The hyperparameters λ1 and λ2 in Eq. 12 are empirically set to 0.1. |