Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
Authors: Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned Audio Set. |
| Researcher Affiliation | Collaboration | NVIDIA, CA, USA, University of Maryland, College Park, USA EMAIL |
| Pseudocode | Yes | A.6 ALGORITHM Algorithm 1 algorithmically illustrated Synthio. |
| Open Source Code | Yes | Our project page has all the codes and checkpoints to reproduce the results in the paper. All experimental details, including training parameters and hyper-parameters, are provided in Section 5. Project: https://sreyan88.github.io/Synthio/ |
| Open Datasets | Yes | Our selected datasets include a mix of music, everyday sounds, and acoustic scenes. For multi-class classification, we use NSynth Instruments, TUT Urban, ESC50 (Piczak), USD8K (Salamon et al., 2014), GTZAN (Tzanetakis et al., 2001), Medley-solos-DB (Lostanlen & Cella, 2017), MUSDB18 (Rafii et al., 2017), DCASE Task 4 (Mesaros et al., 2017), and Vocal Sounds (VS) (Mesaros et al., 2017), evaluating them for accuracy. For multi-label classification, we use the FSD50K (Fonseca et al., 2022) dataset and evaluate it using the F macro 1 metric. We exclude Audio Set from evaluation as Sound-VECaps is derived from it. |
| Dataset Splits | Yes | Our experiments are conducted with n = {50, 100, 200, 500} samples, and we downsample the validation sets for training while evaluating all models on the original test splits. To ensure a downsampled dataset that has a label distribution similar to that of the of the original dataset, we employ stratified sampling based on categories. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It mentions the T2A model and classification model architectures but not the underlying hardware. |
| Software Dependencies | No | The paper mentions employing an Adam W optimizer, a learning rate, and a weight decay, along with using the Audio Spectrogram Transformer (AST) and Stable Audio architectures. However, it does not specify software versions for these tools or any other libraries (e.g., Python version, PyTorch version, CUDA version). |
| Experiment Setup | Yes | For training, we employ a batch size of 64, an Adam W optimizer, a learning rate of 5e-4, and a weight decay of 1e-3 for 40 epochs. For DPO-based alignment tuning, we generate j = 2 losers and fine-tune with a batch size of 32 and a learning rate of 5e-4 for 12 epochs. For our audio classification model, we employ the Audio Spectrogram Transformer (AST) (Gong et al., 2021) (pre-trained on the Audio Set dataset) and fine-tune it with a batch size of 24 and a learning rate of 1e-4 for 50 epochs. For CLAP filtering, we employ p = 0.85. For prompting our diffusion model we use Text CFG=7.0. In each experiment, we adjust the number of generated augmentations N (ranging from 1 to 5) based on performance on the validation set. All results are averaged across 3 runs. |