DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors
Authors: Keon Lee, Dong Won Kim, Jaehyeon Kim, Seungjun Chung, Jaewoong Cho
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through rigorous analysis and empirical exploration, we find that (1) Di T with minimal modifications outperforms U-Net, (2) variable-length modeling with a speech length predictor significantly improves results over fixed-length approaches, and (3) conditions like semantic alignment in speech latent representations are key to further enhancement. By scaling our training data to 82K hours and the model size to 790M parameters, we achieve superior or comparable zero-shot performance to state-of-the-art TTS models in naturalness, intelligibility, and speaker similarity, all without relying on domain-specific factors. Speech samples are available at https://ditto-tts.github.io. |
| Researcher Affiliation | Industry | Keon Lee1, Dong Won Kim1, Jaehyeon Kim2 , Seungjun Chung1, Jaewoong Cho1 1KRAFTON, 2NVIDIA |
| Pseudocode | No | The paper includes Figure 1, which provides an overview diagram of the Di TTo-TTS model. While it illustrates the components and their connections, it does not present structured pseudocode or an algorithm block with step-by-step instructions in a code-like format. |
| Open Source Code | No | Additionally, if legal concerns can be addressed, we plan to gradually release the inference code, pre-trained weights, and eventually the full training implementation, enabling the research community to further explore and validate our findings. |
| Open Datasets | Yes | We employ publicly available speech-transcript datasets totaling 82K hours from over 12K unique speakers across nine languages: English, Korean, German, Dutch, French, Spanish, Italian, Portuguese, and Polish. ... Details of each dataset are provided in Appendix A.2. ... MLS (Pratap et al., 2020), Giga Speech (Chen et al., 2021), Libri TTS-R (Koizumi et al., 2023), VCTK (Veaux et al., 2016) and (5) LJSpeech (Ito & Johnson, 2017)... Libri Speech (Panayotov et al., 2015)... AIHub (www.aihub.or.kr)... Kspon Speech (Bang et al., 2020)... Expresso Dataset (Nguyen et al., 2023)... |
| Dataset Splits | Yes | For the evaluation of Di TTo-en and baseline models, we use the test-clean subset of Libri Speech, which consists of speech clips ranging from 4 to 10 seconds with transcripts. For Di TTo-multi, we randomly select 100 examples from the test set of each language dataset, with clip durations ranging from 4 to 20 seconds. ... We also conduct two experiments to evaluate dataset scalability using 0.5K, 5.5K, 10.5K, and 50.5Khour subsets. |
| Hardware Specification | Yes | All models are trained on 4 NVIDIA A100 40GB GPUs, and use T = 1,000 discrete diffusion steps. |
| Software Dependencies | No | The paper mentions several software components and models used (e.g., Speech T5, By T5, Adam W optimizer, Big VGAN, CTC-based Hu BERT-Large model, Open AI s Whisper model, NVIDIA s Ne Mo-text-processing, Wav LMTDCNN) but does not provide specific version numbers for these dependencies, which are crucial for reproducibility. |
| Experiment Setup | Yes | All models are trained on 4 NVIDIA A100 40GB GPUs, and use T = 1,000 discrete diffusion steps. The S and B models of Di TTo-en are trained with a maximum token size of 5,120 and a gradient accumulation step of 2 over 1M steps. The L and XL models are trained with a maximum token size of 1,280 and a gradient accumulation step of 4 over 1M steps. The Di TTo-multi model is trained only in the XL configuration, with a maximum token size of 320 and a gradient accumulation step of 4 over 1M steps. ... We use the Adam W optimizer (Loshchilov & Hutter, 2019) with the learning late of 1e-4, beta values of (0.9, 0.999), and a weight decay of 0.0. We use a cosine learning rate scheduler with a warmup of 1K steps. ... we determine the optimal noise and CFG scales to be 0.3 and 5.0, respectively. |