Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning
Authors: Arnesh Batra, Dev Sharma, Krish Thukral, Ruhani Bhatia, Naman Batra, Aditya Gautam
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate through extensive experiments that CLAM significantly outperforms existing models, achieving a new state-of-the-art F1-score of 0.925 on our challenging Mo M benchmark, and provide a comprehensive analysis of the generalization failures of current detectors. |
| Researcher Affiliation | Academia | Arnesh Batra1 EMAIL Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), India Dev Sharma1 EMAIL Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), India Krish Thukral2 EMAIL Manipal University Jaipur, Rajasthan, India Ruhani Bhatia1 EMAIL Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), India Naman Batra3 EMAIL Netaji Subhas University of Technology (NSUT), Delhi, India Aditya Gautam1 EMAIL Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), India |
| Pseudocode | Yes | Algorithm 1: In-Batch Triplet Loss Calculation for Alignment Input: Batch of data containing instrumental embeddings EI, vocal embeddings EV , and labels L. Output: Average Triplet Loss for the real samples in the batch. Ereal I {ei EI | Li = 0} Ereal V {ev EV | Lv = 0} N |Ereal I | Number of real samples in the batch Lbatch 0 count 0 if N > 1 then for i 0 to N 1 do Indices start from 0 in code ea Ereal I [i] Anchor: Instrumental embedding of real sample i ep Ereal V [i] Positive: Vocal embedding of real sample i for j 0 to N 1 do if i = j then en Ereal V [j] Negative: Vocal embedding of real sample j d2 pos ea ep 2 2 Squared L2 distance d2 neg ea en 2 2 Ltriplet_ij max(0, d2 pos d2 neg + α) Lbatch Lbatch + Ltriplet_ij count count + 1 end end end end if count > 0 then Lbatch Lbatch/count Average loss over valid triplets end return Lbatch |
| Open Source Code | Yes | For reproducibility, all experiments were run with 5 seeds and the results provided are average across those them, and all code is provided in the supplementary material. |
| Open Datasets | Yes | A New, Large-Scale Benchmark for Generalization (Mo M): We release the most diverse synthetic music dataset to date, featuring a wide array of generative models and a dedicated out-of-distribution test set designed to measure real-world robustness. All AI-generated songs created for the Mo M dataset will be released on Hugging Face under a Creative Commons CC BY-NC 4.0 license, permitting non-commercial research use. |
| Dataset Splits | Yes | Mo M provides 130,435 audio tracks organized into three operational tiers: Real, Fully Fake, and Mostly Fake, which together enable a nuanced evaluation of detection models. Table 4: This table shows the number of audio samples used for training/validation and testing, organized by model. Models marked in red are closed-source (e.g., Suno, Udio, Riffusion, Voice Clone) and models marked in blue are open-source (e.g., Yue and Diffrythm). Model Train / Validation Test Suno 3.5 23695 Udio 1.5 19500 Diffrythm 4606 Suno 2 110 Suno 1 48 Suno 3 3512 Riffusion 7057 Yue 5278 Voice Clones 1166 Total 47911 17061 |
| Hardware Specification | Yes | All models were trained on an NVIDIA RTX 4060 Ti 16GB GPU. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer (Loshchilov & Hutter, 2019) but does not provide specific version numbers for software dependencies such as programming languages, libraries, or other tools. |
| Experiment Setup | Yes | All models were trained on an NVIDIA RTX 4060 Ti 16GB GPU. We used the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 1e-4, a batch size of 128, and an embedding dimension of 512 for 50 epochs. |