Unsupervised Audio-Visual Segmentation with Modality Alignment
Authors: Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the AVSBench (single and multi-object splits) and AVSS datasets demonstrate that Mo CA outperforms unsupervised baseline approaches and some supervised counterparts, particularly in complex scenarios with multiple auditory objects. In terms of m Io U, Mo CA achieves a substantial improvement over baselines in both the AVSBench (S4: +17.24%; MS3: +67.64%) and AVSS (+19.23%) audiovisual segmentation challenges. We perform a comprehensive evaluation and benchmark the performance of our framework against with existing approaches, demonstrating that Mo CA significantly outperforms the baselines on both the AVSBench (S4: +17.24%; MS3: +67.64%) and AVSS (+19.23%) datasets in terms of m Io U, narrowing the performance gap with supervised AVS alternatives. |
| Researcher Affiliation | Academia | 1University of Surrey, UK 2Imperial College London, UK |
| Pseudocode | No | The paper describes the methodology in narrative text and uses diagrams, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not contain an explicit statement or a link indicating that the authors' implementation code for Mo CA is publicly available. It only references a third-party tool's code for SAM at 'github.com/facebookresearch/segment-anything/blob/main/notebooks/automatic mask generator example'. |
| Open Datasets | Yes | For training our Ada AV weights, we employ VGGSound dataset (Chen et al. 2020), a large scale audiovisual dataset collected from You Tube. To assess our proposed method, we utilize the AVSBench dataset (Zhou et al. 2022)... We also report scores on the AVS-Semantic (AVSS) dataset (Zhou et al. 2023). Trained on the large Audioset dataset (Gemmeke et al. 2017)... |
| Dataset Splits | Yes | Benchmark settings To assess our proposed method, we utilize the AVSBench dataset (Zhou et al. 2022), which includes You Tube videos split into five clips, each with a spatial segmentation mask for audible objects. AVSBench consists of two subsets: Semi-supervised Single Source Segmentation (S4) and fully supervised Multiple Sound Source Segmentation (MS3), differing by the number of audible objects. We also report scores on the AVS-Semantic (AVSS) dataset (Zhou et al. 2023). Our evaluation on the S4, MS3, and AVSS test splits is conducted without using audio-mask pairs from the training and validation splits, emphasizing unsupervised AVS. |
| Hardware Specification | Yes | We train models for a maximum of 10000 iterations on a single NVIDIA RTX A5500 GPU with batch size of 8. |
| Software Dependencies | No | The paper mentions 'Adam (Loshchilov and Hutter 2017) optimizer' and 'off-the-shelf foundation models like DINO, SAM, and Image Bind' but does not specify version numbers for any software libraries or dependencies. No programming language version is mentioned either. |
| Experiment Setup | Yes | To optimize the model parameters, we employ the Adam (Loshchilov and Hutter 2017) optimizer with an initial learning rate of 1e 4 with cosine decay. We use λSSD = λNCC = 1 and α = 0.3. We train models for a maximum of 10000 iterations on a single NVIDIA RTX A5500 GPU with batch size of 8. For the mask proposal matching, we consider only the proposal masks with Io U>0.5, when compared with the mask generated from the fused Vi A encoder (post k-means). |