Multi-View Collaborative Learning Network for Speech Deepfake Detection
Authors: Kuiyuan Zhang, Zhongyun Hua, Rushi Lan, Yifang Guo, Yushu Zhang, Guoai Xu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on four benchmark deepfake speech detection datasets, and the experimental results demonstrate that our method can achieve better detection performance than current state-of-the-art detection methods. We further validate the effectiveness of our approach through comprehensive ablation studies. |
| Researcher Affiliation | Collaboration | 1 School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China 2 School of Computer Science and Information Security, Guilin University of Electronic Technology, China 3 Alibaba Group, China 4 School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics, China |
| Pseudocode | No | The paper describes the methodology using text and network diagrams (Figure 2), but no structured pseudocode or algorithm blocks are provided. |
| Open Source Code | No | The paper states: "We train and test all baselines mentioned above, using their publicly available code or building the models exactly as described in their paper if no publicly available code." This refers to baseline models, not the authors' own method. There is no explicit statement about releasing the code for the proposed method. |
| Open Datasets | Yes | We evaluate our proposed method using deepfake speech datasets: ASVspoof2019 logical access (LA) (Wang et al. 2020) dataset, MLAAD (M uller et al. 2024) dataset, ASVspoof2021 LA dataset and ASVspoof2021 deepfake (DF) dataset (Liu et al. 2023). |
| Dataset Splits | Yes | Table 1 and Table 2 list the number of used synthesizer methods and the number of real and fake samples of these datasets. In the ASVspoof2021 DF dataset, the synthesizer methods are divided into five categories: neural vocoder autoregressive (AR), neural vocoder non-autoregressive (NAR), traditional vocoder (TRD), unknown (UNK), and waveform concatenation (CONC). Considering that there are too many fake samples in the testing split, we use all the real samples but only partial fake samples for evaluation. Specifically, the number of fake samples per category is equal to the number of bonafide samples in the testing subset we used. |
| Hardware Specification | Yes | All experiments are conducted using the PyTorch framework on a computer equipped with a GTX 4090 GPU device. |
| Software Dependencies | No | All experiments are conducted using the PyTorch framework on a computer equipped with a GTX 4090 GPU device. This only mentions PyTorch but no specific version number or other libraries with versions. |
| Experiment Setup | Yes | During training, the batch size is set to 64, and the Adam optimizer with a weight decay of 0.01. The learning rate of the parameters of the classifiers is set to 1e 4, while that of other parameters is set to 5e 4. To enhance feature diversity, we implement the following data augmentation strategies for all detection methods: randomly adding noise with signal-to-noise ratio (SNR) ranging from 10 to 120 d B and randomly applying pitch shifts. We employ the early stopping strategy to halt model training when there is no further improvement in the area under the ROC Curve (AUC) performance within three epochs for all detection methods. |