Improving Generalization for AI-Synthesized Voice Detection
Authors: Hainan Ren, Li Lin, Chun-Hao Liu, Xin Wang, Shu Hu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on benchmarks show our approach outperforms state-of-the-art methods, achieving up to 5.12% improvement in the equal error rate metric in intra-domain and 7.59% in cross-domain evaluations. Our extensive experiments conducted on various prominent audio deepfake datasets demonstrate the effectiveness of our framework, which surpasses the performance of state-of-the-art methods in improving the generalization for cross-domain detection. |
| Researcher Affiliation | Collaboration | Hainan Ren*, Li Lin1, Chun-Hao Liu2, Xin Wang3, Shu Hu1 1 Purdue University 2 Amazon 3 University at Albany, SUNY EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The underlying idea is that perturbing the model in the direction of the gradient norm increases the loss value, thereby improving generalization. We optimize Eq. (3) using stochastic gradient descent, and the related algorithm is provided in the Appendix. |
| Open Source Code | Yes | Code https://github.com/Purdue-M2/AI-Synthesized Voice-Generalization |
| Open Datasets | Yes | To assess the generalization of our method, we tested it on various mainstream audio benchmarks, including Libri Se Voc (Sun et al. 2023), Wave Fake (Frank and Sch onherr 2021), ASVspoof 2019 (Lavrentyeva et al. 2019), and the audio segment of Fake AVCeleb (Khalid et al. 2021). |
| Dataset Splits | Yes | We divide the test sets into two categories: seen vocoders from the same domain and unseen vocoders for cross-domain evaluation, based on the vocoder categories present in the training set. More details of dataset-vocoder partitions can be found in Appendix. Ablation on the number of vocoders in the train set. To illustrate the impact of vocoder diversity in the training dataset on the model generalization. We create subsets of the training data with different combinations of vocoder types, ranging from 1 to 6, sourced from Libri Se Voc. Trained models are evaluated on the similar seen/unseen manner, and the a EERs and a EERu are reported. |
| Hardware Specification | Yes | Acknowledgments This work is supported by the U.S. National Science Foundation (NSF) under grant IIS-2434967 and the National Artificial Intelligence Research Resource (NAIRR) Pilot and TACC Lonestar6. |
| Software Dependencies | No | We employ the Adam (Kingma and Ba 2014) optimizer with a learning rate set to 0.0002 and a batch size of 16. |
| Experiment Setup | Yes | We employ the Adam (Kingma and Ba 2014) optimizer with a learning rate set to 0.0002 and a batch size of 16. Hyperparameters λ1, λ2, λ3, and λ4 are set to 0.1, 0.3, 0.05, and 0.03, respectively. The margin b in Lcon is set to 3. The γ in Eq. (2) is set to 0.07. We also use the original voice signal as input and apply the same data preprocessing as Raw Net2 (Tak et al. 2021), padding all signals to the same size. |