SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes

Authors: Tony Alex, Sara Atito, Armin Mustafa, Muhammad Awais, Philip Jackson

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We thoroughly evaluate SSLAM on standard audio SSL benchmark datasets which are predominantly monophonic and conduct a comprehensive comparative analysis against state-of-the-art (SOTA) methods using a range of high-quality, publicly available polyphonic datasets. SSLAM not only improves model performance on polyphonic audio, but also maintains or exceeds performance on standard audio SSL benchmarks. Notably, it achieves up to a 3.9% improvement on the Audio Set-2M(AS-2M), reaching a mean average precision (m AP) of 50.2. For polyphonic datasets, SSLAM sets new SOTA in both linear evaluation and fine-tuning regimes with performance improvements of up to 9.1%(m AP).
Researcher Affiliation Academia Tony Alex, Sara Ahmed, Armin Mustafa, Muhammad Awais , Philip JB Jackson Surrey Institute for People-Centred AI, University of Surrey, Guildford, GU2 7XH, UK Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey EMAIL
Pseudocode Yes Algorithm 1 Efficient Incorporation of Training Objectives 1: Input: A batch of log-mel spectrograms B 2: Step 1: Create a partially mixed batch Bm by rolling and mixing B along the batch dimension. 3: Step 2: Concatenate B and Bm to form a combined batch 2B. 4: Step 3: Forward 2B through the student and teacher networks, reducing the number of multitask clones from 16 to 8 for consistency with the baseline. 5: Step 4: For SRL, mask and drop unmixed regions in B post-positional embedding and forward the result to the teacher. 6: Step 5: Compute the five training objectives using the relevant parts of the batches.
Open Source Code Yes Code and pre-trained models are available at https://github.com/ta012/SSLAM.
Open Datasets Yes For pre-training, we utilized the AS-2M dataset without any label information. For downstream evaluation, we employed various audio SSL benchmark datasets, including AS-2M, AS-20K, ESC50, KS1, and KS2, as well as polyphonic datasets such as SPASS, IDMT-DESED-FL, and URBANSED. More information about these datasets can be found in Appendix B. Audio Set (Gemmeke et al., 2017) is a large-scale dataset... Environmental Sound Classification (ESC-50) (Piczak, 2015) is a collection... Speech Commands (KS1, KS2) (Warden, 2018) are datasets... IDMT-DESED-FL (Johnson et al., 2021) dataset was created... URBAN-SED (Salamon et al., 2017) introduced the Scaper library...
Dataset Splits Yes Environmental Sound Classification (ESC-50) (Piczak, 2015)... we employ a 5-fold cross-validation setting and report the classification accuracy as the evaluation metric. Speech Commands (KS1, KS2) (Warden, 2018)... we train models on the training split, select the best-performing model based on validation, and report test results. SPASS... Each soundscape contains approximately 3,750 samples in the training split and 1,250 samples in the evaluation split. IDMT-DESED-FL (Johnson et al., 2021)... The training split contains 10,000 audio files, while the evaluation split includes 2,000 files. URBAN-SED (Salamon et al., 2017)... the training set consists of 5,268 audio files, and the evaluation set contains 1,739 files.
Hardware Specification Yes All pre-training experiments were conducted on 4 Nvidia 3090 GPUs, with each epoch taking 7 hours in Stage 1 and 7.5 hours in Stage 2. All downstream tasks, except for AS-2M, were trained using 1 Nvidia 3090 GPU, while AS-2M used 1 Nvidia A100 GPU
Software Dependencies No The paper mentions optimizer `Adam W (Loshchilov & Hutter, 2017)` and several data augmentation/regularization techniques such as `Dropout (Srivastava et al., 2014)`, `Drop path (Huang et al., 2016)`, `Spec Aug (Park et al., 2019)`, and `Mixup (Zhang et al., 2017)`. These refer to algorithms or methods described in the cited papers, not specific software libraries with version numbers. There is no explicit mention of programming languages or deep learning frameworks with their version numbers.
Experiment Setup Yes Table 7: SSLAM pre-training and audio SSL benchmark dataset fine-tuning hyper-parameters. and Table 8: SSLAM polyphonic datasets linear evaluation and fine-tuning hyper-parameters. These tables provide detailed hyperparameter settings including Optimizer, Optimizer Momentum, Weight Decay, Learning Rate Schedule, Peak Learning Rate, Minimum Learning Rate, Steps/Epochs, Warm-up steps/epochs, Batch size, Dropout, Drop path, various augmentation parameters (Weighted Sampling, Roll Augmentation, Noise Augmentation, Spec Aug, Mixup), Multilabel setting, Loss Function, and Dataset Mean/Std for Normalization for both pre-training and fine-tuning stages.