Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers

Authors: Peter Ochieng

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on single- and multi-speaker datasets demonstrate that UDPNet consistently outperforms state-of-the-art methods in both quality and efficiency, while generalizing effectively to unseen speech. These results position UDPNet as a robust solution for real-time speech synthesis applications.
Researcher Affiliation Academia Peter Ochieng po304.cam.ac.uk Department of Computer Science and Technology University of Cambridge
Pseudocode Yes Algorithm 1 Training Algorithm with τ, T, x0, Codebook Z Algorithm 2 Sampling Algorithm with τ, xt, xl=t θ (.), and T τ t T Nτ
Open Source Code No The paper mentions 'Sample audio is available at https://onexpeters.github.io/UDPNet/' and provides links to baseline models' code. However, it does not provide an explicit statement or a direct link to the open-source code for UDPNet itself.
Open Datasets Yes To ensure comparability with existing tools and maintain alignment with trends in the speech synthesis domain, we evaluated UDPNet on two popular datasets: LJSpeech for single-speaker speech generation and VCTK for multi-speaker evaluation.
Dataset Splits Yes The LJSpeech dataset consists of 13,100 audio clips sampled at 22 k Hz, totaling approximately 24 hours of single-speaker audio. ... Following (Chen et al., 2020), we used 12,764 utterances (23 hours) for training and 130 utterances for testing. For multi-speaker evaluation, we used the VCTK dataset, which includes recordings of 109 English speakers with diverse accents, originally sampled at 48 k Hz and downsampled to 22 k Hz for consistency. Following (Lam et al., 2022), we used a split where 100 speakers were used for training and 9 speakers were held out for evaluation.
Hardware Specification Yes UDPNet was trained on a single NVIDIA V100 GPU using the Adam optimizer.
Software Dependencies No The paper mentions software components and methods like 'Adam optimizer', 'cyclical learning rate (Smith, 2017)', and 'Tacotron 2 (Shen et al., 2018)', but it does not specify any version numbers for programming languages, libraries, or frameworks used for implementation.
Experiment Setup Yes UDPNet was trained on a single NVIDIA V100 GPU using the Adam optimizer. A cyclical learning rate (Smith, 2017) was employed, with the learning rate varying between 1e 4 and 1e 1. The batch size was set to 32, and training was performed over 1 million steps, taking approximately 8 days to complete. For conditional speech generation, Mel-spectrograms extracted from ground truth audio were used as conditioning features during training. During testing, Mel-spectrograms were generated by Tacotron 2 (Shen et al., 2018). UDPNet was evaluated using different forward diffusion steps (fsteps) while maintaining a fixed number of 8 reverse steps. The forward steps considered were 1200, 1000, 960, 720, and 240, corresponding to skip parameters τ = {150, 125, 120, 90, 30}, respectively. The forward noise schedule αi was defined as a linear progression across all steps: αi = Linear(α1, αN, N), where N represents the total number of forward steps. For example, with 1200 forward steps, the schedule was specified as Linear(1 10 4, 0.005, 1200). Each layer s contribution to the total loss Lt 1 (Equation 20) was weighted using a layer-specific factor λ. The weights were initialized at λ = 0.001 for the first layer and incremented by 0.001 for each subsequent layer.