DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

Authors: Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, Yuxuan Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In zero-shot speech generation, Di TAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness. Audio samples are presented in https://spicyresearch.github.io/ditar. We apply Di TAR to zero-shot speech generation and achieve SOTA performance with a much lower computational load. In this subsection, we benchmark Di TAR against leading systems and demonstrate its state-of-the-art performance. We conduct a multi-dimensional comparison of Di TAR with other baseline works. For objective metrics, Table 1 presents the evaluation results on Libri Speech test-clean. For subjective evaluation, we invite 10 English experts to rate the generated audio.
Researcher Affiliation Industry Dongya Jia 1 Zhuo Chen 1 Jiawei Chen 1 Chenpeng Du 1 Jian Wu 1 Jian Cong 1 Xiaobin Zhuang 1 Chumin Li 1 Zhen Wei 1 Yuping Wang 1 Yuxuan Wang 1 1Byte Dance Seed. Correspondence to: Dongya Jia <EMAIL>, Zhuo Chen <EMAIL>.
Pseudocode Yes Algorithm 1 Temperature sampling Input: v-prediction model fθ(., .), discretized time points t1 < t2 < ... < t N 1 [0, 1), t N = 1,ODE solver Ψ(.,.,.), transformation function F, temperature τ η argmin n=1,2,...,N |tn τ|
Open Source Code No The paper provides a link for audio samples ('Audio samples are presented in https://spicyresearch.github.io/ditar.'), but does not provide concrete access to the source code for the methodology described.
Open Datasets Yes To ensure a fair evaluation of zero-shot TTS, it is essential to consider prompt audio, texts, and tools. We standardize these variables to facilitate a more objective and fair comparison between systems. Training and Evaluation Dataset. We consider two open-source datasets as our training dataset. 1) Librilight(Kahn et al., 2020), containing 60K hours of English speech data from Libri Vox audiobooks. 2) Emilia(He et al., 2024), a multilingual dataset containing around 100k hours of speech. We adopt three open-source datasets for evaluation: 1) Libri Speech(PC)(Panayotov et al., 2015; Meister et al., 2023) test-clean, containing 40 distinct English speakers and a 5.4-hour speech. We employ two established subsets: subset A from Natural Speech3, featuring 40 three-second speech prompts and 40 target samples, and subset B from F5TTS, which includes 40 prompts and 1127 samples. 2) Seed-ZH: a subset from Di Di Speech 2(Guo et al., 2021), a Chinese speech dataset, containing 1088 prompts and targets. 3) Seed-EN: a subset from Common Voice(Ardila et al., 2019), a crowdsourcing English speech dataset with diverse accents, containing 2020 prompts and targets.
Dataset Splits No The paper describes the composition and sizes of the evaluation datasets (Libri Speech test-clean subsets A and B, Seed-ZH, Seed-EN) by specifying the number of prompts, samples, or speakers. However, it does not explicitly provide the training/validation/test splits for its primary training datasets (Librilight and Emilia) needed to reproduce the data partitioning process for model development.
Hardware Specification Yes We utilize 16 A100 GPUs, each processing a batch size of 15K tokens, and train Di TAR for 0.5M steps. For Di TAR with 1B parameters, we utilize 32 A100 GPUs with a batch size of 7.5k per GPU. We evaluate all metrics by inferring 10 seconds of audio on an A100 GPU.
Software Dependencies No The paper does not provide specific software names with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experiments.
Experiment Setup Yes We utilize 16 A100 GPUs, each processing a batch size of 15K tokens, and train Di TAR for 0.5M steps. The Adam W optimizer is employed with a constant learning rate of 1e-4, βt = 0.9, and β2 = 0.99. We conduct comparisons using Di TAR with 0.6 billion parameters and a patch size of 4. During inference, Di TAR s Loc Di T uses an NFE (Number of Function Evaluations) of 10. Specific details about the parameters of Di TAR are provided in Appendix A.2. Table 8: Configurations of Di TAR with different sizes.