reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contrastive Learning from Synthetic Audio Doppelgängers

Authors: Manuel Cherep, Nikhil Singh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a comprehensive set of experiments, we show that models trained this way can yield strong performance on a wide range of downstream tasks, competitive with real audio. Notably, our approach is lightweight, requires no data storage, and has only a single hyperparameter, which we extensively analyze. We offer this method as a complement to existing strategies for contrastive learning in audio, using synthesized sounds to reduce the data burden on practitioners.
Researcher Affiliation	Academia	Manuel Cherep Massachusetts Institute of Technology EMAIL Nikhil Singh Dartmouth College EMAIL
Pseudocode	Yes	Algorithm 1 Our contrastive learning procedure with audio doppelg angers. In the training loop, we drop the batch index i for simplicity. We also show the pairwise distance in ℓnuif, though the implementation (via torch.pdist) uses a condensed representation.
Open Source Code	No	We will release our code and models to enable the community to experiment with synthetic data sources for audio understanding, and hope this approach will help expand the machine learning toolkit for audio processing.
Open Datasets	Yes	To compare to real audio data, we use sounds from VGGSound (Chen et al., 2020a), a well-known dataset taken from You Tube videos (we only use audio). These tasks cover a wide range of capabilities including sound classification tasks like ESC-50 (Piczak, 2015), FSD-50k (Fonseca et al., 2021a), and Urban Sound8K (Salamon et al., 2014), vocal affect tasks with and without speech like VIVAE (Holz et al., 2022) and CREMA-D (Cao et al., 2014), musical pitch recognition via NSynth Pitch (5h) (Engel et al., 2017), vocal sound imitation recognition using Vocal Imitations (Kim et al., 2018), and Libri Count (St oter et al., 2018) for a cocktail party style speaker count estimation task.
Dataset Splits	Yes	We train for 200 epochs, generating (or sampling) 100,000 sounds per epoch, with a 90%-10% trainvalidation split. We use a batch size of 768 per GPU with two V100s. The training uses the alignment and uniformity objectives (Wang & Isola, 2020) used in prior work on learning with synthetic data (Baradad Jurjo et al., 2021). We use either the validation sets or first multi-fold splits of the target task audio.
Hardware Specification	Yes	We use a batch size of 768 per GPU with two V100s.
Software Dependencies	No	Our data generation pipeline uses virtual modular synthesizers implemented by SYNTHAX (Cherep & Singh, 2023) in JAX. By default, we use the Voice synthesizer architecture (Turian et al., 2021), which can generate perceptually diverse sounds. In our experiments, we use VGGish frontend representations (Hershey et al., 2017). We resample audio to 16k Hz and obtain mel spectrograms with 64 mel bands and 96 time steps. We use a chain of effects as augmentations (implemented in torch-audiomentations1): a high-pass filter (cutoff frequency range 20 800Hz), a low-pass filter (1.2 8k Hz), pitch shift (-2 to 2 semitones), time shift (-25% to 25%, rollover enabled), and finally reverberation for which we sample randomly from a set of impulse responses.
Experiment Setup	Yes	We train for 200 epochs, generating (or sampling) 100,000 sounds per epoch, with a 90%-10% trainvalidation split. We use a batch size of 768 per GPU with two V100s. The training uses the alignment and uniformity objectives (Wang & Isola, 2020) used in prior work on learning with synthetic data (Baradad Jurjo et al., 2021). We adopt the default parameters for these: unift = 2, alignα = 2, and equal weights λ1 = λ2 = 1 for both terms. Following this work, we use stochastic gradient descent for optimization, with a maximum learning rate of 0.72 (calculated as 0.12 total batch size) and weight decay 10 6. The learning rate follows a multi-step schedule with γ = 0.1, and milestones at 77.5%, 85%, and 92.5% of the total learning epochs.