DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

Authors: Yinghao Aaron Li, Rithesh Kumar, Zeyu Jin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude.
Researcher Affiliation Collaboration Yinghao Aaron Li 1 Rithesh Kumar 2 Zeyu Jin 2 *Equal contribution 1Columbia University. Work done during an internship at Adobe 2Adobe Research. Correspondence to: Yinghao Aaron Li <EMAIL>.
Pseudocode Yes Our sampling algorithm of the student (DMOSpeech) is similar to that of the consistency model (Song et al., 2023). The sampling procedure is outlined in Algorithm 1.
Open Source Code No The audio samples are available at https://dmospeech.github.io.
Open Datasets Yes We conducted our experiments on the Libri Light dataset (Kahn et al., 2020), which consists of 57,706.4 hours of audio from 7,439 speakers. We trained our ASR model on Common Voice (Ardila et al., 2019) and Libri Light (Kahn et al., 2020) datasets for 200k steps with the Adam W (Loshchilov & Hutter, 2018) optimizer. We further demonstrate the general applicability of our framework, we also conducted experiments training a DMOSpeech model using F5-TTS (Chen et al., 2024c) as the teacher model on Emilia dataset (He et al., 2024) and compared it against other recent state-of-the-art models, including F5-TTS itself and Mask GCT (Wang et al., 2024). We followed the setup in (Chen et al., 2024c) to train the teacher model and used the same hyperparameters as detailed in section 4.1 to train the student model for 200k steps. Our model and baseline models were evaluated on the Seed-TTS test set (Anastassiou et al., 2024).
Dataset Splits Yes We conducted our experiments on the Libri Light dataset (Kahn et al., 2020)... For both experiments, the samples were downsampled to 16 k Hz for fairness and prompts were transcribed using Whisper X for synthesis. For subjective evaluation, we selected 80 samples, ensuring that each speaker from the test-clean subset was represented by two samples.
Hardware Specification Yes All models were trained on 24 NVIDIA A100 40GB GPUs. The Real-Time Factor (RTF) of the distilled model is 13.7 times lower than the teacher model, which is lower than all baseline methods by a large margin. The real-time factor (RTF) was computed on a NVIDIA V100 GPU except Di TTo-TTS and CLa M-TTS, whose RTF is obtained from their papers using the inference time needed to synthesize 10s of speech divided by 10 on unknown devices.
Software Dependencies No The paper mentions specific tools like "Phonemizer" and "Whisper X" but does not provide version numbers for these or other key software components used for the methodology (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The teacher model fϕ was trained for 400,000 steps with a batch size of 384, using the Adam W optimizer (Loshchilov & Hutter, 2018) with β1 = 0.9, β2 = 0.999, weight decay of 10-2, and an initial learning rate of 10-4. The learning rate followed a cosine decay schedule with a 4,000-step warmup, gradually decreasing to 10-5. Model weights were updated using an exponential moving average (EMA) with a decay factor of 0.99 every 100 steps. The teacher model consists of 450M parameters in total. For student training, we initialized both the student generator Gθ and the student score model gψ with the EMA-weighted teacher parameters. The initial learning rate was set to match the final learning rate of the teacher model (λ = 10-5), while the batch size was reduced to 96 due to memory constraints. We set λadv = 10-3 to ensure the gradient norm of Ladv is comparable to that of LDMD. During early training stage, we observed that the gradient norms of LSV and LCTC were significantly higher than LDMD, likely because Gθ was still learning to generate intelligible speech from single step. To address this, we set λCTC = 0 and λSV = 0 for the first 5,000 and 10,000 iterations, respectively. After that, both λCTC and λSV are set to 1.