reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

I Can Hear You: Selective Robust Training for Deepfake Audio Detection

Authors: Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel Mendiola-Ortiz, Junfeng Yang, Chengzhi Mao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results demonstrate that our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art Raw Net3 model. We first compare our approach to existing state-of-the-art methods across three benchmarks and demonstrate improved accuracy. We then assess its robustness against corruption and adversarial attacks. We finally conduct ablation study on enhancements to the detection system s robustness.
Researcher Affiliation	Academia	Zirui Zhang1, Wei Hao1, Aroon Sankoh2, William Lin3, Emanuel Mendiola Ortiz4, Junfeng Yang1, Chengzhi Mao5 1Columbia University, 2Washington University in St. Louis, 3New York University, 4Pennsylvania State University, 5Rutgers University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Figure 10: Python code for Rand Augment for audio
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for the methodology described is publicly available or open-source.
Open Datasets	Yes	To address concerns regarding You Tube s terms of service and ethical considerations, we will not directly distribute any content sourced from You Tube. Instead, we will provide only metadata (video ID, start time, and end time), ensuring that researchers can access the content through You Tube s official interface in compliance with the platform s policies. Our training dataset now includes deepfake audio samples generated using the top seven TTS models: Meta Voice-1B (Liu et al., 2021), Style TTS-v2 (Li et al., 2024), Voice Craft (Peng et al., 2024), Whisper Speech (Radford et al., 2023), Vokan TTS, XTTS-v2 (Casanova et al., 2024), and Elevenlabs. We use four datasets VCTK (Yamagishi, 2012), Libri Speech (Panayotov et al., 2015), In-The-Wilds (M uller et al., 2022), and Audio Set (Gemmeke et al., 2017) to generate deepfake audio. Thus, for real audio, we utilize portions from six public audio datasets: VCTK, Libri Speech, Audio Set, ASRspoof2019 (Todisco et al., 2019), Voxceleb1 (Nagrani et al., 2017), and ASRspoof2021 (Liu et al., 2023), with half consisting of clean audio and the other half of noisy audio.
Dataset Splits	Yes	In this section, we introduce a new training dataset and a rigorous test set. In contrast to prior dataset, our dataset is large, diversified, realistic, and up-to-date, as shown in Table 1. Prior detectors show poor generalization capabilities in realistic settings, as shown in Figure 8. Both our training and testing datasets integrate the latest advancements in AI voice synthesis technologies. Additionally, the testing dataset includes several new models not covered in the training dataset, specifically designed to test the generalization ability of our detection systems. Our test dataset comprises approximately 6,000 samples, with an equal balance between real and fake audio. Table 1: Comparison of Deepfake Audio Datasets: Our Train 690k 640k English Clean, Noisy 2024 40 TTS, VC Our Test 3k 3k English Clean, Noisy 2024 15 TTS, VC For the experiments described in this paper, we specifically utilize seven fake sources not present in the training set. However, we also include samples from seven other fake sources used during training to facilitate future research.
Hardware Specification	Yes	Table 10: Training time comparasion. Hardware Used A100 GPU
Software Dependencies	No	The paper mentions an "Optimizer: adam" but does not provide any specific version numbers for software dependencies such as programming languages, libraries, or other tools.
Experiment Setup	Yes	A.8.1 TRAINING HYPERPARAMETER Here are the Training hyperparameter of F-SAT for Table 3: Training Hyperparameters Learning Rate (lr): 1 10 5 Batch Size (bs): 16 Optimizer: adam Augmentation Number (aug num): 1 or 2 Augmentation Probability (aug prob): 0.9 LR Scheduler (Warmup Cosine) Warm-up Epochs: 1 Warm-up LR: 1 10 6 Minimum LR: 1 10 7 Attack Hyperparameters Attack Type: l Epsilon: 0.005, Alpha: 0.002 Gamma (control ratio of clean loss and robust loss): 0.1 Attack Iterations: 2 Restarts: 1 Frequency Range: 4-8k Hz Mixup Hyperparameters Mixup Alpha: 0.5