Multi-View Collaborative Learning Network for Speech Deepfake Detection

Authors: Kuiyuan Zhang, Zhongyun Hua, Rushi Lan, Yifang Guo, Yushu Zhang, Guoai Xu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on four benchmark deepfake speech detection datasets, and the experimental results demonstrate that our method can achieve better detection performance than current state-of-the-art detection methods. We further validate the effectiveness of our approach through comprehensive ablation studies.
Researcher Affiliation Collaboration 1 School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China 2 School of Computer Science and Information Security, Guilin University of Electronic Technology, China 3 Alibaba Group, China 4 School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics, China
Pseudocode No The paper describes the methodology using text and network diagrams (Figure 2), but no structured pseudocode or algorithm blocks are provided.
Open Source Code No The paper states: "We train and test all baselines mentioned above, using their publicly available code or building the models exactly as described in their paper if no publicly available code." This refers to baseline models, not the authors' own method. There is no explicit statement about releasing the code for the proposed method.
Open Datasets Yes We evaluate our proposed method using deepfake speech datasets: ASVspoof2019 logical access (LA) (Wang et al. 2020) dataset, MLAAD (M uller et al. 2024) dataset, ASVspoof2021 LA dataset and ASVspoof2021 deepfake (DF) dataset (Liu et al. 2023).
Dataset Splits Yes Table 1 and Table 2 list the number of used synthesizer methods and the number of real and fake samples of these datasets. In the ASVspoof2021 DF dataset, the synthesizer methods are divided into five categories: neural vocoder autoregressive (AR), neural vocoder non-autoregressive (NAR), traditional vocoder (TRD), unknown (UNK), and waveform concatenation (CONC). Considering that there are too many fake samples in the testing split, we use all the real samples but only partial fake samples for evaluation. Specifically, the number of fake samples per category is equal to the number of bonafide samples in the testing subset we used.
Hardware Specification Yes All experiments are conducted using the PyTorch framework on a computer equipped with a GTX 4090 GPU device.
Software Dependencies No All experiments are conducted using the PyTorch framework on a computer equipped with a GTX 4090 GPU device. This only mentions PyTorch but no specific version number or other libraries with versions.
Experiment Setup Yes During training, the batch size is set to 64, and the Adam optimizer with a weight decay of 0.01. The learning rate of the parameters of the classifiers is set to 1e 4, while that of other parameters is set to 5e 4. To enhance feature diversity, we implement the following data augmentation strategies for all detection methods: randomly adding noise with signal-to-noise ratio (SNR) ranging from 10 to 120 d B and randomly applying pitch shifts. We employ the early stopping strategy to halt model training when there is no further improvement in the area under the ROC Curve (AUC) performance within three epochs for all detection methods.