Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers
Authors: Peter Ochieng
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on single- and multi-speaker datasets demonstrate that UDPNet consistently outperforms state-of-the-art methods in both quality and efficiency, while generalizing effectively to unseen speech. These results position UDPNet as a robust solution for real-time speech synthesis applications. |
| Researcher Affiliation | Academia | Peter Ochieng po304.cam.ac.uk Department of Computer Science and Technology University of Cambridge |
| Pseudocode | Yes | Algorithm 1 Training Algorithm with τ, T, x0, Codebook Z Algorithm 2 Sampling Algorithm with τ, xt, xl=t θ (.), and T τ t T Nτ |
| Open Source Code | No | The paper mentions 'Sample audio is available at https://onexpeters.github.io/UDPNet/' and provides links to baseline models' code. However, it does not provide an explicit statement or a direct link to the open-source code for UDPNet itself. |
| Open Datasets | Yes | To ensure comparability with existing tools and maintain alignment with trends in the speech synthesis domain, we evaluated UDPNet on two popular datasets: LJSpeech for single-speaker speech generation and VCTK for multi-speaker evaluation. |
| Dataset Splits | Yes | The LJSpeech dataset consists of 13,100 audio clips sampled at 22 k Hz, totaling approximately 24 hours of single-speaker audio. ... Following (Chen et al., 2020), we used 12,764 utterances (23 hours) for training and 130 utterances for testing. For multi-speaker evaluation, we used the VCTK dataset, which includes recordings of 109 English speakers with diverse accents, originally sampled at 48 k Hz and downsampled to 22 k Hz for consistency. Following (Lam et al., 2022), we used a split where 100 speakers were used for training and 9 speakers were held out for evaluation. |
| Hardware Specification | Yes | UDPNet was trained on a single NVIDIA V100 GPU using the Adam optimizer. |
| Software Dependencies | No | The paper mentions software components and methods like 'Adam optimizer', 'cyclical learning rate (Smith, 2017)', and 'Tacotron 2 (Shen et al., 2018)', but it does not specify any version numbers for programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | UDPNet was trained on a single NVIDIA V100 GPU using the Adam optimizer. A cyclical learning rate (Smith, 2017) was employed, with the learning rate varying between 1e 4 and 1e 1. The batch size was set to 32, and training was performed over 1 million steps, taking approximately 8 days to complete. For conditional speech generation, Mel-spectrograms extracted from ground truth audio were used as conditioning features during training. During testing, Mel-spectrograms were generated by Tacotron 2 (Shen et al., 2018). UDPNet was evaluated using different forward diffusion steps (fsteps) while maintaining a fixed number of 8 reverse steps. The forward steps considered were 1200, 1000, 960, 720, and 240, corresponding to skip parameters τ = {150, 125, 120, 90, 30}, respectively. The forward noise schedule αi was defined as a linear progression across all steps: αi = Linear(α1, αN, N), where N represents the total number of forward steps. For example, with 1200 forward steps, the schedule was specified as Linear(1 10 4, 0.005, 1200). Each layer s contribution to the total loss Lt 1 (Equation 20) was weighted using a layer-specific factor λ. The weights were initialized at λ = 0.001 for the first layer and incremented by 0.001 for each subsequent layer. |