reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unsupervised Audio-Visual Segmentation with Modality Alignment

Authors: Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on the AVSBench (single and multi-object splits) and AVSS datasets demonstrate that Mo CA outperforms unsupervised baseline approaches and some supervised counterparts, particularly in complex scenarios with multiple auditory objects. In terms of m Io U, Mo CA achieves a substantial improvement over baselines in both the AVSBench (S4: +17.24%; MS3: +67.64%) and AVSS (+19.23%) audiovisual segmentation challenges. We perform a comprehensive evaluation and benchmark the performance of our framework against with existing approaches, demonstrating that Mo CA significantly outperforms the baselines on both the AVSBench (S4: +17.24%; MS3: +67.64%) and AVSS (+19.23%) datasets in terms of m Io U, narrowing the performance gap with supervised AVS alternatives.
Researcher Affiliation	Academia	1University of Surrey, UK 2Imperial College London, UK
Pseudocode	No	The paper describes the methodology in narrative text and uses diagrams, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not contain an explicit statement or a link indicating that the authors' implementation code for Mo CA is publicly available. It only references a third-party tool's code for SAM at 'github.com/facebookresearch/segment-anything/blob/main/notebooks/automatic mask generator example'.
Open Datasets	Yes	For training our Ada AV weights, we employ VGGSound dataset (Chen et al. 2020), a large scale audiovisual dataset collected from You Tube. To assess our proposed method, we utilize the AVSBench dataset (Zhou et al. 2022)... We also report scores on the AVS-Semantic (AVSS) dataset (Zhou et al. 2023). Trained on the large Audioset dataset (Gemmeke et al. 2017)...
Dataset Splits	Yes	Benchmark settings To assess our proposed method, we utilize the AVSBench dataset (Zhou et al. 2022), which includes You Tube videos split into five clips, each with a spatial segmentation mask for audible objects. AVSBench consists of two subsets: Semi-supervised Single Source Segmentation (S4) and fully supervised Multiple Sound Source Segmentation (MS3), differing by the number of audible objects. We also report scores on the AVS-Semantic (AVSS) dataset (Zhou et al. 2023). Our evaluation on the S4, MS3, and AVSS test splits is conducted without using audio-mask pairs from the training and validation splits, emphasizing unsupervised AVS.
Hardware Specification	Yes	We train models for a maximum of 10000 iterations on a single NVIDIA RTX A5500 GPU with batch size of 8.
Software Dependencies	No	The paper mentions 'Adam (Loshchilov and Hutter 2017) optimizer' and 'off-the-shelf foundation models like DINO, SAM, and Image Bind' but does not specify version numbers for any software libraries or dependencies. No programming language version is mentioned either.
Experiment Setup	Yes	To optimize the model parameters, we employ the Adam (Loshchilov and Hutter 2017) optimizer with an initial learning rate of 1e 4 with cosine decay. We use λSSD = λNCC = 1 and α = 0.3. We train models for a maximum of 10000 iterations on a single NVIDIA RTX A5500 GPU with batch size of 8. For the mask proposal matching, we consider only the proposal masks with Io U>0.5, when compared with the mask generated from the fused Vi A encoder (post k-means).