Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Towards generalizing deep-audio fake detection networks

Authors: Konstantin Gasenzer, Moritz Wolter

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our study reveals stable frequency domain artifacts for many modern speech synthesis networks. We visualize generator artifacts for all generators in the Wave Fake dataset (Frank & Schönherr, 2021) generators and the Avocodo (Bak et al., 2022) and Big VGAN (Lee et al., 2023a) networks. We reproduce and improve upon synthetic media-recognition results published for the Wave Fake dataset (Frank & Schönherr, 2021). ... We ran our experiments on a four compute-node cluster with two AMD EPYC 7402 2.8 GHz host CPUs and four NVidia A100 Tensor Core graphics cards per host with 40 GB memory each. All experiments require four GPUs. Our experimental work builds upon Pytorch (Paszke et al., 2019) and the Pytorch Wavelet-Toolbox (Moritz Wolter, 2021).
Researcher Affiliation Academia Konstantin Gasenzer, Moritz Wolter EMAIL High Performance Computing and Analytics Lab, University of Bonn, Germany
Pseudocode No The paper includes architectural diagrams (Figure 3 and Figure 21) and mathematical equations for transformations, but no structured pseudocode or algorithm blocks are explicitly labeled or presented in a code-like format.
Open Source Code Yes Project source code and the dataset extension are available online 1. 1https://github.com/gan-police/audiodeepfake-detection, https://zenodo.org/records/10512541
Open Datasets Yes Project source code and the dataset extension are available online 1. 1https://github.com/gan-police/audiodeepfake-detection, https://zenodo.org/records/10512541. ... We consider the Wave Fake dataset (Frank & Schönherr, 2021) with all its generators, including the conformer-TTS samples and the Japanese language examples from the JSUT dataset. We extend the Wave Fake dataset by adding samples drawn from the Avocodo (Bak et al., 2022) and Big VGAN (Lee et al., 2023a) architectures.
Dataset Splits No The paper states: "All samples are cut into one-second segments. All sets contain an equal amount of real and fake samples." and "Test accuracies and average Equal Error Rates (a EERs) are computed for test samples from all eight generators and the original audio, where we measure our detector s ability to separate real and fake." It also mentions: "Our models are trained exclusively on samples drawn from a full-band Mel GAN." While it describes how data was prepared and used, it does not provide specific split percentages, sample counts for train/validation/test, or refer to a standard split methodology with citations.
Hardware Specification Yes We ran our experiments on a four compute-node cluster with two AMD EPYC 7402 2.8 GHz host CPUs and four NVidia A100 Tensor Core graphics cards per host with 40 GB memory each.
Software Dependencies Yes Our experimental work builds upon Pytorch (Paszke et al., 2019) and the Pytorch Wavelet-Toolbox (Moritz Wolter, 2021). ... Due to a lack of pre-trained weights, we retrained Avocodo using the publicly available implementation from (Bak et al., 2023) commit 2999557.
Experiment Setup Yes Adam (Kingma & Ba, 2015) optimizes almost all networks with a learning rate of 0.0004. We follow Gong et al. (2021) and set the step size to 0.00004 for the AST. Each training step used 128 audio samples per batch. Finally, we employ weight decay and dropout. The L2 penalty is set to 0.001 unless stated otherwise. ... We trained for 346 epochs or 563528 steps. Hyperparameters were chosen according to Bak et al. (2022) with a learning rate of 0.0002 for the discriminator and generator.