Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning

Authors: Arnesh Batra, Dev Sharma, Krish Thukral, Ruhani Bhatia, Naman Batra, Aditya Gautam

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate through extensive experiments that CLAM significantly outperforms existing models, achieving a new state-of-the-art F1-score of 0.925 on our challenging Mo M benchmark, and provide a comprehensive analysis of the generalization failures of current detectors.
Researcher Affiliation Academia Arnesh Batra1 EMAIL Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), India Dev Sharma1 EMAIL Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), India Krish Thukral2 EMAIL Manipal University Jaipur, Rajasthan, India Ruhani Bhatia1 EMAIL Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), India Naman Batra3 EMAIL Netaji Subhas University of Technology (NSUT), Delhi, India Aditya Gautam1 EMAIL Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), India
Pseudocode Yes Algorithm 1: In-Batch Triplet Loss Calculation for Alignment Input: Batch of data containing instrumental embeddings EI, vocal embeddings EV , and labels L. Output: Average Triplet Loss for the real samples in the batch. Ereal I {ei EI | Li = 0} Ereal V {ev EV | Lv = 0} N |Ereal I | Number of real samples in the batch Lbatch 0 count 0 if N > 1 then for i 0 to N 1 do Indices start from 0 in code ea Ereal I [i] Anchor: Instrumental embedding of real sample i ep Ereal V [i] Positive: Vocal embedding of real sample i for j 0 to N 1 do if i = j then en Ereal V [j] Negative: Vocal embedding of real sample j d2 pos ea ep 2 2 Squared L2 distance d2 neg ea en 2 2 Ltriplet_ij max(0, d2 pos d2 neg + α) Lbatch Lbatch + Ltriplet_ij count count + 1 end end end end if count > 0 then Lbatch Lbatch/count Average loss over valid triplets end return Lbatch
Open Source Code Yes For reproducibility, all experiments were run with 5 seeds and the results provided are average across those them, and all code is provided in the supplementary material.
Open Datasets Yes A New, Large-Scale Benchmark for Generalization (Mo M): We release the most diverse synthetic music dataset to date, featuring a wide array of generative models and a dedicated out-of-distribution test set designed to measure real-world robustness. All AI-generated songs created for the Mo M dataset will be released on Hugging Face under a Creative Commons CC BY-NC 4.0 license, permitting non-commercial research use.
Dataset Splits Yes Mo M provides 130,435 audio tracks organized into three operational tiers: Real, Fully Fake, and Mostly Fake, which together enable a nuanced evaluation of detection models. Table 4: This table shows the number of audio samples used for training/validation and testing, organized by model. Models marked in red are closed-source (e.g., Suno, Udio, Riffusion, Voice Clone) and models marked in blue are open-source (e.g., Yue and Diffrythm). Model Train / Validation Test Suno 3.5 23695 Udio 1.5 19500 Diffrythm 4606 Suno 2 110 Suno 1 48 Suno 3 3512 Riffusion 7057 Yue 5278 Voice Clones 1166 Total 47911 17061
Hardware Specification Yes All models were trained on an NVIDIA RTX 4060 Ti 16GB GPU.
Software Dependencies No The paper mentions using the Adam W optimizer (Loshchilov & Hutter, 2019) but does not provide specific version numbers for software dependencies such as programming languages, libraries, or other tools.
Experiment Setup Yes All models were trained on an NVIDIA RTX 4060 Ti 16GB GPU. We used the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 1e-4, a batch size of 128, and an embedding dimension of 512 for 50 epochs.