reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes

Authors: Tony Alex, Sara Atito, Armin Mustafa, Muhammad Awais, Philip Jackson

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We thoroughly evaluate SSLAM on standard audio SSL benchmark datasets which are predominantly monophonic and conduct a comprehensive comparative analysis against state-of-the-art (SOTA) methods using a range of high-quality, publicly available polyphonic datasets. SSLAM not only improves model performance on polyphonic audio, but also maintains or exceeds performance on standard audio SSL benchmarks. Notably, it achieves up to a 3.9% improvement on the Audio Set-2M(AS-2M), reaching a mean average precision (m AP) of 50.2. For polyphonic datasets, SSLAM sets new SOTA in both linear evaluation and fine-tuning regimes with performance improvements of up to 9.1%(m AP).
Researcher Affiliation	Academia	Tony Alex, Sara Ahmed, Armin Mustafa, Muhammad Awais , Philip JB Jackson Surrey Institute for People-Centred AI, University of Surrey, Guildford, GU2 7XH, UK Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey EMAIL
Pseudocode	Yes	Algorithm 1 Efficient Incorporation of Training Objectives 1: Input: A batch of log-mel spectrograms B 2: Step 1: Create a partially mixed batch Bm by rolling and mixing B along the batch dimension. 3: Step 2: Concatenate B and Bm to form a combined batch 2B. 4: Step 3: Forward 2B through the student and teacher networks, reducing the number of multitask clones from 16 to 8 for consistency with the baseline. 5: Step 4: For SRL, mask and drop unmixed regions in B post-positional embedding and forward the result to the teacher. 6: Step 5: Compute the five training objectives using the relevant parts of the batches.
Open Source Code	Yes	Code and pre-trained models are available at https://github.com/ta012/SSLAM.
Open Datasets	Yes	For pre-training, we utilized the AS-2M dataset without any label information. For downstream evaluation, we employed various audio SSL benchmark datasets, including AS-2M, AS-20K, ESC50, KS1, and KS2, as well as polyphonic datasets such as SPASS, IDMT-DESED-FL, and URBANSED. More information about these datasets can be found in Appendix B. Audio Set (Gemmeke et al., 2017) is a large-scale dataset... Environmental Sound Classification (ESC-50) (Piczak, 2015) is a collection... Speech Commands (KS1, KS2) (Warden, 2018) are datasets... IDMT-DESED-FL (Johnson et al., 2021) dataset was created... URBAN-SED (Salamon et al., 2017) introduced the Scaper library...
Dataset Splits	Yes	Environmental Sound Classification (ESC-50) (Piczak, 2015)... we employ a 5-fold cross-validation setting and report the classification accuracy as the evaluation metric. Speech Commands (KS1, KS2) (Warden, 2018)... we train models on the training split, select the best-performing model based on validation, and report test results. SPASS... Each soundscape contains approximately 3,750 samples in the training split and 1,250 samples in the evaluation split. IDMT-DESED-FL (Johnson et al., 2021)... The training split contains 10,000 audio files, while the evaluation split includes 2,000 files. URBAN-SED (Salamon et al., 2017)... the training set consists of 5,268 audio files, and the evaluation set contains 1,739 files.
Hardware Specification	Yes	All pre-training experiments were conducted on 4 Nvidia 3090 GPUs, with each epoch taking 7 hours in Stage 1 and 7.5 hours in Stage 2. All downstream tasks, except for AS-2M, were trained using 1 Nvidia 3090 GPU, while AS-2M used 1 Nvidia A100 GPU
Software Dependencies	No	The paper mentions optimizer `Adam W (Loshchilov & Hutter, 2017)` and several data augmentation/regularization techniques such as `Dropout (Srivastava et al., 2014)`, `Drop path (Huang et al., 2016)`, `Spec Aug (Park et al., 2019)`, and `Mixup (Zhang et al., 2017)`. These refer to algorithms or methods described in the cited papers, not specific software libraries with version numbers. There is no explicit mention of programming languages or deep learning frameworks with their version numbers.
Experiment Setup	Yes	Table 7: SSLAM pre-training and audio SSL benchmark dataset fine-tuning hyper-parameters. and Table 8: SSLAM polyphonic datasets linear evaluation and fine-tuning hyper-parameters. These tables provide detailed hyperparameter settings including Optimizer, Optimizer Momentum, Weight Decay, Learning Rate Schedule, Peak Learning Rate, Minimum Learning Rate, Steps/Epochs, Warm-up steps/epochs, Batch size, Dropout, Drop path, various augmentation parameters (Weighted Sampling, Roll Augmentation, Noise Augmentation, Spec Aug, Mixup), Multilabel setting, Loss Function, and Dataset Mean/Std for Normalization for both pre-training and fine-tuning stages.