reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

Authors: Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned Audio Set.
Researcher Affiliation	Collaboration	NVIDIA, CA, USA, University of Maryland, College Park, USA EMAIL
Pseudocode	Yes	A.6 ALGORITHM Algorithm 1 algorithmically illustrated Synthio.
Open Source Code	Yes	Our project page has all the codes and checkpoints to reproduce the results in the paper. All experimental details, including training parameters and hyper-parameters, are provided in Section 5. Project: https://sreyan88.github.io/Synthio/
Open Datasets	Yes	Our selected datasets include a mix of music, everyday sounds, and acoustic scenes. For multi-class classification, we use NSynth Instruments, TUT Urban, ESC50 (Piczak), USD8K (Salamon et al., 2014), GTZAN (Tzanetakis et al., 2001), Medley-solos-DB (Lostanlen & Cella, 2017), MUSDB18 (Rafii et al., 2017), DCASE Task 4 (Mesaros et al., 2017), and Vocal Sounds (VS) (Mesaros et al., 2017), evaluating them for accuracy. For multi-label classification, we use the FSD50K (Fonseca et al., 2022) dataset and evaluate it using the F macro 1 metric. We exclude Audio Set from evaluation as Sound-VECaps is derived from it.
Dataset Splits	Yes	Our experiments are conducted with n = {50, 100, 200, 500} samples, and we downsample the validation sets for training while evaluating all models on the original test splits. To ensure a downsampled dataset that has a label distribution similar to that of the of the original dataset, we employ stratified sampling based on categories.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It mentions the T2A model and classification model architectures but not the underlying hardware.
Software Dependencies	No	The paper mentions employing an Adam W optimizer, a learning rate, and a weight decay, along with using the Audio Spectrogram Transformer (AST) and Stable Audio architectures. However, it does not specify software versions for these tools or any other libraries (e.g., Python version, PyTorch version, CUDA version).
Experiment Setup	Yes	For training, we employ a batch size of 64, an Adam W optimizer, a learning rate of 5e-4, and a weight decay of 1e-3 for 40 epochs. For DPO-based alignment tuning, we generate j = 2 losers and fine-tune with a batch size of 32 and a learning rate of 5e-4 for 12 epochs. For our audio classification model, we employ the Audio Spectrogram Transformer (AST) (Gong et al., 2021) (pre-trained on the Audio Set dataset) and fine-tune it with a batch size of 24 and a learning rate of 1e-4 for 50 epochs. For CLAP filtering, we employ p = 0.85. For prompting our diffusion model we use Text CFG=7.0. In each experiment, we adjust the number of generated augmentations N (ranging from 1 to 5) based on performance on the validation set. All results are averaged across 3 runs.