reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Generalization for AI-Synthesized Voice Detection

Authors: Hainan Ren, Li Lin, Chun-Hao Liu, Xin Wang, Shu Hu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on benchmarks show our approach outperforms state-of-the-art methods, achieving up to 5.12% improvement in the equal error rate metric in intra-domain and 7.59% in cross-domain evaluations. Our extensive experiments conducted on various prominent audio deepfake datasets demonstrate the effectiveness of our framework, which surpasses the performance of state-of-the-art methods in improving the generalization for cross-domain detection.
Researcher Affiliation	Collaboration	Hainan Ren*, Li Lin1, Chun-Hao Liu2, Xin Wang3, Shu Hu1 1 Purdue University 2 Amazon 3 University at Albany, SUNY EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The underlying idea is that perturbing the model in the direction of the gradient norm increases the loss value, thereby improving generalization. We optimize Eq. (3) using stochastic gradient descent, and the related algorithm is provided in the Appendix.
Open Source Code	Yes	Code https://github.com/Purdue-M2/AI-Synthesized Voice-Generalization
Open Datasets	Yes	To assess the generalization of our method, we tested it on various mainstream audio benchmarks, including Libri Se Voc (Sun et al. 2023), Wave Fake (Frank and Sch onherr 2021), ASVspoof 2019 (Lavrentyeva et al. 2019), and the audio segment of Fake AVCeleb (Khalid et al. 2021).
Dataset Splits	Yes	We divide the test sets into two categories: seen vocoders from the same domain and unseen vocoders for cross-domain evaluation, based on the vocoder categories present in the training set. More details of dataset-vocoder partitions can be found in Appendix. Ablation on the number of vocoders in the train set. To illustrate the impact of vocoder diversity in the training dataset on the model generalization. We create subsets of the training data with different combinations of vocoder types, ranging from 1 to 6, sourced from Libri Se Voc. Trained models are evaluated on the similar seen/unseen manner, and the a EERs and a EERu are reported.
Hardware Specification	Yes	Acknowledgments This work is supported by the U.S. National Science Foundation (NSF) under grant IIS-2434967 and the National Artificial Intelligence Research Resource (NAIRR) Pilot and TACC Lonestar6.
Software Dependencies	No	We employ the Adam (Kingma and Ba 2014) optimizer with a learning rate set to 0.0002 and a batch size of 16.
Experiment Setup	Yes	We employ the Adam (Kingma and Ba 2014) optimizer with a learning rate set to 0.0002 and a batch size of 16. Hyperparameters λ1, λ2, λ3, and λ4 are set to 0.1, 0.3, 0.05, and 0.03, respectively. The margin b in Lcon is set to 3. The γ in Eq. (2) is set to 0.07. We also use the original voice signal as input and apply the same data preprocessing as Raw Net2 (Tak et al. 2021), padding all signals to the same size.