Complex-Cycle-Consistent Diffusion Model for Monaural Speech Enhancement

Authors: Yi Li, Yang Sun, Plamen P Angelov

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on public datasets to demonstrate the effectiveness of our method, highlighting the significant benefits of exploiting the intrinsic relationship between phase and magnitude information to enhance speech. The comparison to conventional diffusion models demonstrates the superiority of SEDM. Experiments Datasets We extensively perform experiments on several public speech datasets, including IEEE (IEEE Audio and Electroacoustics Group 1969), TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT) (Garofolo et al. 1993), VOICE BANK (VCTK) (Veaux, Yamagishi, and King 2013), and Deep Noise Suppression (DNS) challenge (Reddy et al. 2021).
Researcher Affiliation Academia Yi Li1, Yang Sun2, Plamen P Angelov1 1School of Computing and Communications, Lancaster University, UK 2Big Data Institute, University of Oxford, UK
Pseudocode Yes The pseudocode of the proposed CCC module is summarized as Algorithm 1. Algorithm 1: Proposed complex-cycle-consistent learning
Open Source Code No The paper does not contain any explicit statements about providing source code, nor does it include links to a code repository or mention code in supplementary materials.
Open Datasets Yes We extensively perform experiments on several public speech datasets, including IEEE (IEEE Audio and Electroacoustics Group 1969), TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT) (Garofolo et al. 1993), VOICE BANK (VCTK) (Veaux, Yamagishi, and King 2013), and Deep Noise Suppression (DNS) challenge (Reddy et al. 2021). To generate noisy speech signals in training and test, we randomly collect and use 10 of 15 noise types ... from Diverse Environments Multichannel Acoustic Noise Database (DEMAND) (Thiemann, Ito, and Vincent 2013).
Dataset Splits Yes Evaluations on the IEEE and TIMIT Datasets The first experiment is conducted on IEEE and TIMIT (IEEE Audio and Electroacoustics Group 1969; Garofolo et al. 1993). In the training and development stages, 600 recordings from 60 speakers and 60 recordings from 6 speakers are randomly selected in each dataset, respectively. ... We randomly generate 11572 noisy mixtures with 10 background noises at one of 4 SNR levels (15, 10, 5, and 0 d B) in the training stage. The test set with 2 speakers, unseen during training, consists of a total of 20 different noise conditions: 5 types of noise sourced from the DEMAND dataset at one of 4 SNRs each (17.5, 12.5, 7.5, and 2.5 d B). This yields 824 test items, with approximately 20 different sentences in each condition per test speaker. ... In the training stage, 75% of the clean speeches are mixed with the background noise but without reverberation at a random SNR in between -5 and 20 d B as (Hao et al. 2021). In the test stage, 150 noisy clips are randomly selected from the blind test dataset without reverberations.
Hardware Specification Yes All the experiments are run on Tesla V100 GPUs.
Software Dependencies No The paper mentions 'Adam optimizer' but does not specify its version, nor does it list other software dependencies (e.g., programming languages, libraries, or frameworks) with version numbers.
Experiment Setup Yes Model Configuration We set the number of diffusion blocks and channels as [N,C] [30,63],[40,128],[50,128] for small, medium, and large SEDM models (SEDM-S, SEDM-M, SEDM-L), respectively. The number of reverse blocks is equal to the number of diffusion blocks, i.e., M = N. The kernel size of Bi-Dil Conv is 3, and the dilation is doubled at each layer within each block as [1, 2, 4, ..., 2n 1]. Each LSTM in CCC consists of three hidden layers and 30 features in the hidden state. ... The proposed model is trained by using the Adam optimizer with a weight decay of 0.0001, a momentum of 0.9, and a batch size of 64. We train the networks for 200 epochs, where we warm-up the network in the first 20 epochs by without CCC losses. The initial learning rate is 0.03, and is multiplied by 0.1 at 120 and 160 epochs.