Unsupervised Audio-Visual Segmentation with Modality Alignment

Authors: Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the AVSBench (single and multi-object splits) and AVSS datasets demonstrate that Mo CA outperforms unsupervised baseline approaches and some supervised counterparts, particularly in complex scenarios with multiple auditory objects. In terms of m Io U, Mo CA achieves a substantial improvement over baselines in both the AVSBench (S4: +17.24%; MS3: +67.64%) and AVSS (+19.23%) audiovisual segmentation challenges. We perform a comprehensive evaluation and benchmark the performance of our framework against with existing approaches, demonstrating that Mo CA significantly outperforms the baselines on both the AVSBench (S4: +17.24%; MS3: +67.64%) and AVSS (+19.23%) datasets in terms of m Io U, narrowing the performance gap with supervised AVS alternatives.
Researcher Affiliation Academia 1University of Surrey, UK 2Imperial College London, UK
Pseudocode No The paper describes the methodology in narrative text and uses diagrams, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not contain an explicit statement or a link indicating that the authors' implementation code for Mo CA is publicly available. It only references a third-party tool's code for SAM at 'github.com/facebookresearch/segment-anything/blob/main/notebooks/automatic mask generator example'.
Open Datasets Yes For training our Ada AV weights, we employ VGGSound dataset (Chen et al. 2020), a large scale audiovisual dataset collected from You Tube. To assess our proposed method, we utilize the AVSBench dataset (Zhou et al. 2022)... We also report scores on the AVS-Semantic (AVSS) dataset (Zhou et al. 2023). Trained on the large Audioset dataset (Gemmeke et al. 2017)...
Dataset Splits Yes Benchmark settings To assess our proposed method, we utilize the AVSBench dataset (Zhou et al. 2022), which includes You Tube videos split into five clips, each with a spatial segmentation mask for audible objects. AVSBench consists of two subsets: Semi-supervised Single Source Segmentation (S4) and fully supervised Multiple Sound Source Segmentation (MS3), differing by the number of audible objects. We also report scores on the AVS-Semantic (AVSS) dataset (Zhou et al. 2023). Our evaluation on the S4, MS3, and AVSS test splits is conducted without using audio-mask pairs from the training and validation splits, emphasizing unsupervised AVS.
Hardware Specification Yes We train models for a maximum of 10000 iterations on a single NVIDIA RTX A5500 GPU with batch size of 8.
Software Dependencies No The paper mentions 'Adam (Loshchilov and Hutter 2017) optimizer' and 'off-the-shelf foundation models like DINO, SAM, and Image Bind' but does not specify version numbers for any software libraries or dependencies. No programming language version is mentioned either.
Experiment Setup Yes To optimize the model parameters, we employ the Adam (Loshchilov and Hutter 2017) optimizer with an initial learning rate of 1e 4 with cosine decay. We use λSSD = λNCC = 1 and α = 0.3. We train models for a maximum of 10000 iterations on a single NVIDIA RTX A5500 GPU with batch size of 8. For the mask proposal matching, we consider only the proposal masks with Io U>0.5, when compared with the mask generated from the fused Vi A encoder (post k-means).