Complex-Cycle-Consistent Diffusion Model for Monaural Speech Enhancement
Authors: Yi Li, Yang Sun, Plamen P Angelov
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on public datasets to demonstrate the effectiveness of our method, highlighting the significant benefits of exploiting the intrinsic relationship between phase and magnitude information to enhance speech. The comparison to conventional diffusion models demonstrates the superiority of SEDM. Experiments Datasets We extensively perform experiments on several public speech datasets, including IEEE (IEEE Audio and Electroacoustics Group 1969), TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT) (Garofolo et al. 1993), VOICE BANK (VCTK) (Veaux, Yamagishi, and King 2013), and Deep Noise Suppression (DNS) challenge (Reddy et al. 2021). |
| Researcher Affiliation | Academia | Yi Li1, Yang Sun2, Plamen P Angelov1 1School of Computing and Communications, Lancaster University, UK 2Big Data Institute, University of Oxford, UK |
| Pseudocode | Yes | The pseudocode of the proposed CCC module is summarized as Algorithm 1. Algorithm 1: Proposed complex-cycle-consistent learning |
| Open Source Code | No | The paper does not contain any explicit statements about providing source code, nor does it include links to a code repository or mention code in supplementary materials. |
| Open Datasets | Yes | We extensively perform experiments on several public speech datasets, including IEEE (IEEE Audio and Electroacoustics Group 1969), TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT) (Garofolo et al. 1993), VOICE BANK (VCTK) (Veaux, Yamagishi, and King 2013), and Deep Noise Suppression (DNS) challenge (Reddy et al. 2021). To generate noisy speech signals in training and test, we randomly collect and use 10 of 15 noise types ... from Diverse Environments Multichannel Acoustic Noise Database (DEMAND) (Thiemann, Ito, and Vincent 2013). |
| Dataset Splits | Yes | Evaluations on the IEEE and TIMIT Datasets The first experiment is conducted on IEEE and TIMIT (IEEE Audio and Electroacoustics Group 1969; Garofolo et al. 1993). In the training and development stages, 600 recordings from 60 speakers and 60 recordings from 6 speakers are randomly selected in each dataset, respectively. ... We randomly generate 11572 noisy mixtures with 10 background noises at one of 4 SNR levels (15, 10, 5, and 0 d B) in the training stage. The test set with 2 speakers, unseen during training, consists of a total of 20 different noise conditions: 5 types of noise sourced from the DEMAND dataset at one of 4 SNRs each (17.5, 12.5, 7.5, and 2.5 d B). This yields 824 test items, with approximately 20 different sentences in each condition per test speaker. ... In the training stage, 75% of the clean speeches are mixed with the background noise but without reverberation at a random SNR in between -5 and 20 d B as (Hao et al. 2021). In the test stage, 150 noisy clips are randomly selected from the blind test dataset without reverberations. |
| Hardware Specification | Yes | All the experiments are run on Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions 'Adam optimizer' but does not specify its version, nor does it list other software dependencies (e.g., programming languages, libraries, or frameworks) with version numbers. |
| Experiment Setup | Yes | Model Configuration We set the number of diffusion blocks and channels as [N,C] [30,63],[40,128],[50,128] for small, medium, and large SEDM models (SEDM-S, SEDM-M, SEDM-L), respectively. The number of reverse blocks is equal to the number of diffusion blocks, i.e., M = N. The kernel size of Bi-Dil Conv is 3, and the dilation is doubled at each layer within each block as [1, 2, 4, ..., 2n 1]. Each LSTM in CCC consists of three hidden layers and 30 features in the hidden state. ... The proposed model is trained by using the Adam optimizer with a weight decay of 0.0001, a momentum of 0.9, and a batch size of 64. We train the networks for 200 epochs, where we warm-up the network in the first 20 epochs by without CCC losses. The initial learning rate is 0.03, and is multiplied by 0.1 at 120 and 160 epochs. |