DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation
Authors: Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, Yuxuan Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In zero-shot speech generation, Di TAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness. Audio samples are presented in https://spicyresearch.github.io/ditar. We apply Di TAR to zero-shot speech generation and achieve SOTA performance with a much lower computational load. In this subsection, we benchmark Di TAR against leading systems and demonstrate its state-of-the-art performance. We conduct a multi-dimensional comparison of Di TAR with other baseline works. For objective metrics, Table 1 presents the evaluation results on Libri Speech test-clean. For subjective evaluation, we invite 10 English experts to rate the generated audio. |
| Researcher Affiliation | Industry | Dongya Jia 1 Zhuo Chen 1 Jiawei Chen 1 Chenpeng Du 1 Jian Wu 1 Jian Cong 1 Xiaobin Zhuang 1 Chumin Li 1 Zhen Wei 1 Yuping Wang 1 Yuxuan Wang 1 1Byte Dance Seed. Correspondence to: Dongya Jia <EMAIL>, Zhuo Chen <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Temperature sampling Input: v-prediction model fθ(., .), discretized time points t1 < t2 < ... < t N 1 [0, 1), t N = 1,ODE solver Ψ(.,.,.), transformation function F, temperature τ η argmin n=1,2,...,N |tn τ| |
| Open Source Code | No | The paper provides a link for audio samples ('Audio samples are presented in https://spicyresearch.github.io/ditar.'), but does not provide concrete access to the source code for the methodology described. |
| Open Datasets | Yes | To ensure a fair evaluation of zero-shot TTS, it is essential to consider prompt audio, texts, and tools. We standardize these variables to facilitate a more objective and fair comparison between systems. Training and Evaluation Dataset. We consider two open-source datasets as our training dataset. 1) Librilight(Kahn et al., 2020), containing 60K hours of English speech data from Libri Vox audiobooks. 2) Emilia(He et al., 2024), a multilingual dataset containing around 100k hours of speech. We adopt three open-source datasets for evaluation: 1) Libri Speech(PC)(Panayotov et al., 2015; Meister et al., 2023) test-clean, containing 40 distinct English speakers and a 5.4-hour speech. We employ two established subsets: subset A from Natural Speech3, featuring 40 three-second speech prompts and 40 target samples, and subset B from F5TTS, which includes 40 prompts and 1127 samples. 2) Seed-ZH: a subset from Di Di Speech 2(Guo et al., 2021), a Chinese speech dataset, containing 1088 prompts and targets. 3) Seed-EN: a subset from Common Voice(Ardila et al., 2019), a crowdsourcing English speech dataset with diverse accents, containing 2020 prompts and targets. |
| Dataset Splits | No | The paper describes the composition and sizes of the evaluation datasets (Libri Speech test-clean subsets A and B, Seed-ZH, Seed-EN) by specifying the number of prompts, samples, or speakers. However, it does not explicitly provide the training/validation/test splits for its primary training datasets (Librilight and Emilia) needed to reproduce the data partitioning process for model development. |
| Hardware Specification | Yes | We utilize 16 A100 GPUs, each processing a batch size of 15K tokens, and train Di TAR for 0.5M steps. For Di TAR with 1B parameters, we utilize 32 A100 GPUs with a batch size of 7.5k per GPU. We evaluate all metrics by inferring 10 seconds of audio on an A100 GPU. |
| Software Dependencies | No | The paper does not provide specific software names with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experiments. |
| Experiment Setup | Yes | We utilize 16 A100 GPUs, each processing a batch size of 15K tokens, and train Di TAR for 0.5M steps. The Adam W optimizer is employed with a constant learning rate of 1e-4, βt = 0.9, and β2 = 0.99. We conduct comparisons using Di TAR with 0.6 billion parameters and a patch size of 4. During inference, Di TAR s Loc Di T uses an NFE (Number of Function Evaluations) of 10. Specific details about the parameters of Di TAR are provided in Appendix A.2. Table 8: Configurations of Di TAR with different sizes. |