SONICS: Synthetic Or Not - Identifying Counterfeit Songs

Authors: Awsaf Rahman, Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Bishmoy Paul, Anowarul Fattah

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The comparative analysis of the proposed Spec TTTra models against other existing models is presented in Table 4. The results reveal a significant performance gain (6% for Conv Ne Xt, 8% for Efficient Vi T, 10% for Vi T, and 17% for Spec TTTra-α) in the overall F1 score when using long songs. This finding substantiates our claim that leveraging long-context information is crucial for enhancing fake song detection. [...] We conduct an ablation study to highlight the importance of both temporal and spectral tokens, with the findings summarized in Table 7.
Researcher Affiliation Academia Md Awsafur Rahman UC Santa Barbara, USA EMAIL Zaber Ibn Abdul Hakim , Najibul Haque Sarker Virginia Tech, USA EMAIL Bishmoy Paul Santa Clara University, USA EMAIL Shaikh Anowarul Fattah BUET, Bangladesh EMAIL
Pseudocode Yes Second, the pseudo-code for the Spectro-Temporal Tokenizer of the Spec TTTra model is presented in the Appendix.
Open Source Code Yes 1Code & Data available at https://github.com/awsaf49/sonics
Open Datasets Yes 1Code & Data available at https://github.com/awsaf49/sonics [...] Finally, as these fake songs are generated through paid subscriptions that allow for the use and sharing of content, our dataset will be made publicly available under a CC BY-NC 4.0 license.
Dataset Splits Yes We conduct all experiments using the proposed SONICS dataset, which is divided into train, valid, and test sets. To ensure comprehensive evaluation, the valid and test sets include cases with unseen algorithms (e.g., Suno v2, Suno v3, Udio 32) and unseen singers. We also prevent data leakage by ensuring that song pairs from the same (lyrics, style) inputs are exclusively in either the training or valid-test sets, not in both. The distribution of the train, test, and valid sets is shown in Table 3.
Hardware Specification Yes We conduct our training on an NVIDIA A6000 GPU with 48GB RAM, using Wand B for tracking. [...] To comprehensively evaluate the efficiency of the proposed Spec TTTra model alongside other methods, we measure various metrics across different song lengths using a P100 16GB GPU.
Software Dependencies No We use Vi T-small (patch size = 16) and Conv Ne Xt-tiny along with Efficient Vi T-B2 from the timm (Wightman, 2019) library. [...] For calculating FLOPs, we employed the fvcore (FAIR, 2023) library.
Experiment Setup Yes To train models, we resampled both real and fake songs to 16k Hz and generated spectrograms with n fft = win length = 2048, hop length = 512, and n mels = 128, yielding a 128 128 spectrogram for 5 sec and 128 3744 for 120 sec audio. Any song shorter than input length is zero-padded randomly, while for longer songs, a random crop is used. We also apply Mix Up (Zhang, 2017) and Spec Augment (Park et al., 2019) augmentations during training to improve generalization. [...] We train all models for 50 epochs from scratch using Binary Cross Entropy loss with 0.02 label smoothing (Szegedy et al., 2016). Optimization is performed with Adam W (Loshchilov, 2017) and a cosine learning rate scheduler from timm, including a 5-epoch warm-up.