Patch-level Sounding Object Tracking for Audio-Visual Question Answering
Authors: Zhangbin Li, Jinxing Zhou, Jing Zhang, Shengeng Tang, Kun Li, Dan Guo
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on standard datasets demonstrate the effectiveness of our method, achieving competitive performance even compared to recent large-scale pretraining-based approaches. Extensive quantitative and qualitative results validate the effectiveness of our method. |
| Researcher Affiliation | Academia | School of Computer Science and Information Engineering, Hefei University of Technology EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical formulations, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository or mention code in supplementary materials. |
| Open Datasets | Yes | We primarily conduct experiments on the widely-used and challenging MUSICAVQA (Li et al. 2022) dataset. |
| Dataset Splits | Yes | Following the standard protocol in the pioneering work (Li et al. 2022), we adopt the answer prediction accuracy (%) as the metric for model evaluation. |
| Hardware Specification | Yes | All experiments are conducted on an NVIDIA A40 GPU. |
| Software Dependencies | No | The paper mentions models and optimizers like CLIP-Vi T-L/14, CLAP, and Adam W optimizer, but it does not provide specific version numbers for core software dependencies such as programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA. |
| Experiment Setup | Yes | During model training, we use the Adam W optimizer with an initial learning rate of 1e-4, which decays by 0.1 every 16 epochs. The batch size and epochs are set to 16 and 35, respectively. The numbers of graph layers in Gm t , Gs t , and Gq t are empirically set to 3, 3, and 2, respectively. |