Contrastive Learning from Synthetic Audio Doppelgängers
Authors: Manuel Cherep, Nikhil Singh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a comprehensive set of experiments, we show that models trained this way can yield strong performance on a wide range of downstream tasks, competitive with real audio. Notably, our approach is lightweight, requires no data storage, and has only a single hyperparameter, which we extensively analyze. We offer this method as a complement to existing strategies for contrastive learning in audio, using synthesized sounds to reduce the data burden on practitioners. |
| Researcher Affiliation | Academia | Manuel Cherep Massachusetts Institute of Technology EMAIL Nikhil Singh Dartmouth College EMAIL |
| Pseudocode | Yes | Algorithm 1 Our contrastive learning procedure with audio doppelg angers. In the training loop, we drop the batch index i for simplicity. We also show the pairwise distance in ℓnuif, though the implementation (via torch.pdist) uses a condensed representation. |
| Open Source Code | No | We will release our code and models to enable the community to experiment with synthetic data sources for audio understanding, and hope this approach will help expand the machine learning toolkit for audio processing. |
| Open Datasets | Yes | To compare to real audio data, we use sounds from VGGSound (Chen et al., 2020a), a well-known dataset taken from You Tube videos (we only use audio). These tasks cover a wide range of capabilities including sound classification tasks like ESC-50 (Piczak, 2015), FSD-50k (Fonseca et al., 2021a), and Urban Sound8K (Salamon et al., 2014), vocal affect tasks with and without speech like VIVAE (Holz et al., 2022) and CREMA-D (Cao et al., 2014), musical pitch recognition via NSynth Pitch (5h) (Engel et al., 2017), vocal sound imitation recognition using Vocal Imitations (Kim et al., 2018), and Libri Count (St oter et al., 2018) for a cocktail party style speaker count estimation task. |
| Dataset Splits | Yes | We train for 200 epochs, generating (or sampling) 100,000 sounds per epoch, with a 90%-10% trainvalidation split. We use a batch size of 768 per GPU with two V100s. The training uses the alignment and uniformity objectives (Wang & Isola, 2020) used in prior work on learning with synthetic data (Baradad Jurjo et al., 2021). We use either the validation sets or first multi-fold splits of the target task audio. |
| Hardware Specification | Yes | We use a batch size of 768 per GPU with two V100s. |
| Software Dependencies | No | Our data generation pipeline uses virtual modular synthesizers implemented by SYNTHAX (Cherep & Singh, 2023) in JAX. By default, we use the Voice synthesizer architecture (Turian et al., 2021), which can generate perceptually diverse sounds. In our experiments, we use VGGish frontend representations (Hershey et al., 2017). We resample audio to 16k Hz and obtain mel spectrograms with 64 mel bands and 96 time steps. We use a chain of effects as augmentations (implemented in torch-audiomentations1): a high-pass filter (cutoff frequency range 20 800Hz), a low-pass filter (1.2 8k Hz), pitch shift (-2 to 2 semitones), time shift (-25% to 25%, rollover enabled), and finally reverberation for which we sample randomly from a set of impulse responses. |
| Experiment Setup | Yes | We train for 200 epochs, generating (or sampling) 100,000 sounds per epoch, with a 90%-10% trainvalidation split. We use a batch size of 768 per GPU with two V100s. The training uses the alignment and uniformity objectives (Wang & Isola, 2020) used in prior work on learning with synthetic data (Baradad Jurjo et al., 2021). We adopt the default parameters for these: unift = 2, alignα = 2, and equal weights λ1 = λ2 = 1 for both terms. Following this work, we use stochastic gradient descent for optimization, with a maximum learning rate of 0.72 (calculated as 0.12 total batch size) and weight decay 10 6. The learning rate follows a multi-step schedule with γ = 0.1, and milestones at 77.5%, 85%, and 92.5% of the total learning epochs. |