I Can Hear You: Selective Robust Training for Deepfake Audio Detection
Authors: Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel Mendiola-Ortiz, Junfeng Yang, Chengzhi Mao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results demonstrate that our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art Raw Net3 model. We first compare our approach to existing state-of-the-art methods across three benchmarks and demonstrate improved accuracy. We then assess its robustness against corruption and adversarial attacks. We finally conduct ablation study on enhancements to the detection system s robustness. |
| Researcher Affiliation | Academia | Zirui Zhang1, Wei Hao1, Aroon Sankoh2, William Lin3, Emanuel Mendiola Ortiz4, Junfeng Yang1, Chengzhi Mao5 1Columbia University, 2Washington University in St. Louis, 3New York University, 4Pennsylvania State University, 5Rutgers University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Figure 10: Python code for Rand Augment for audio |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the methodology described is publicly available or open-source. |
| Open Datasets | Yes | To address concerns regarding You Tube s terms of service and ethical considerations, we will not directly distribute any content sourced from You Tube. Instead, we will provide only metadata (video ID, start time, and end time), ensuring that researchers can access the content through You Tube s official interface in compliance with the platform s policies. Our training dataset now includes deepfake audio samples generated using the top seven TTS models: Meta Voice-1B (Liu et al., 2021), Style TTS-v2 (Li et al., 2024), Voice Craft (Peng et al., 2024), Whisper Speech (Radford et al., 2023), Vokan TTS, XTTS-v2 (Casanova et al., 2024), and Elevenlabs. We use four datasets VCTK (Yamagishi, 2012), Libri Speech (Panayotov et al., 2015), In-The-Wilds (M uller et al., 2022), and Audio Set (Gemmeke et al., 2017) to generate deepfake audio. Thus, for real audio, we utilize portions from six public audio datasets: VCTK, Libri Speech, Audio Set, ASRspoof2019 (Todisco et al., 2019), Voxceleb1 (Nagrani et al., 2017), and ASRspoof2021 (Liu et al., 2023), with half consisting of clean audio and the other half of noisy audio. |
| Dataset Splits | Yes | In this section, we introduce a new training dataset and a rigorous test set. In contrast to prior dataset, our dataset is large, diversified, realistic, and up-to-date, as shown in Table 1. Prior detectors show poor generalization capabilities in realistic settings, as shown in Figure 8. Both our training and testing datasets integrate the latest advancements in AI voice synthesis technologies. Additionally, the testing dataset includes several new models not covered in the training dataset, specifically designed to test the generalization ability of our detection systems. Our test dataset comprises approximately 6,000 samples, with an equal balance between real and fake audio. Table 1: Comparison of Deepfake Audio Datasets: Our Train 690k 640k English Clean, Noisy 2024 40 TTS, VC Our Test 3k 3k English Clean, Noisy 2024 15 TTS, VC For the experiments described in this paper, we specifically utilize seven fake sources not present in the training set. However, we also include samples from seven other fake sources used during training to facilitate future research. |
| Hardware Specification | Yes | Table 10: Training time comparasion. Hardware Used A100 GPU |
| Software Dependencies | No | The paper mentions an "Optimizer: adam" but does not provide any specific version numbers for software dependencies such as programming languages, libraries, or other tools. |
| Experiment Setup | Yes | A.8.1 TRAINING HYPERPARAMETER Here are the Training hyperparameter of F-SAT for Table 3: Training Hyperparameters Learning Rate (lr): 1 10 5 Batch Size (bs): 16 Optimizer: adam Augmentation Number (aug num): 1 or 2 Augmentation Probability (aug prob): 0.9 LR Scheduler (Warmup Cosine) Warm-up Epochs: 1 Warm-up LR: 1 10 6 Minimum LR: 1 10 7 Attack Hyperparameters Attack Type: l Epsilon: 0.005, Alpha: 0.002 Gamma (control ratio of clean loss and robust loss): 0.1 Attack Iterations: 2 Restarts: 1 Frequency Range: 4-8k Hz Mixup Hyperparameters Mixup Alpha: 0.5 |