Presto! Distilling Steps and Layers for Accelerating Music Generation

Authors: Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1k Hz, 15x faster than the comparable SOTA model) the fastest TTM to our knowledge. We show the efficacy of Presto via a number of experiments. We first ablate the design choices afforded by Presto-S, and separately show how Presto-L flatly improves standard diffusion sampling. We then show how Presto-L and Presto-S stack up against SOTA baselines, and how we can combine such approaches for further acceleration, with both quantitative and subjective metrics.
Researcher Affiliation Collaboration Zachary Novack UC San Diego Ge Zhu & Jonah Casebeer Adobe Research Julian Mc Auley & Taylor Berg-Kirkpatrick UC San Diego Nicholas J. Bryan Adobe Research
Pseudocode Yes We develop our score-based distribution matching step distillation, Presto-S below and in Fig. 1 as well as the algorithm in Appendix A.3, a pseudo-code walkthrough in Appendix A.4, and expanded visualization in Appendix A.5.
Open Source Code No Following these concerns, we do not plan to release our model, but have done our best to compare against multiple open source baselines and/or re-train alternative methods for comparison and in-depth understanding of the reproducible insights of our work.
Open Datasets Yes For evaluation, we use Song Describer (no vocals) (Manco et al., 2023) split into 32 second chunks.
Dataset Splits No We use a 3.6K hour dataset of mono 44.1 k Hz licensed instrumental music, augmented with pitch-shifting and time-stretching. Data includes musical meta-data and synthetic captions. For evaluation, we use Song Describer (no vocals) (Manco et al., 2023) split into 32 second chunks.
Hardware Specification Yes The base model was trained for 5 days across 32 A100 GPUs with a batch size of 14 and learning rate of 1e-4 with Adam. For all step distillation methods, we distill each model with a batch size of 80 across 16 Nvidia A100 GPUs for 32K iterations. We train all layer distillation methods for 60K iterations with a batch size of 12 across 16 A100 GPUs with a learning rate of 8e-5. For Presto-L, we set ν = 0.1. On an Intel Xeon Platinum 8275CL CPU, we achieve a mono RTF of 0.74, generating 32 seconds of audio in 43.34 seconds.
Software Dependencies Yes We use Flash Attention-2 (Dao, 2023) for the Di T and Pytorch 2.0 s built in graph compilation (Ansel et al., 2024) for the VAE decoder and Music Hifi mono-to-stereo.
Experiment Setup Yes Specifically, we set σdata = 0.5, Pmean = 0.4, Pstd = 1.0, σmax = 80, σmin = 0.002. We train the base model with 10% condition dropout to enable CFG. The base model was trained for 5 days across 32 A100 GPUs with a batch size of 14 and learning rate of 1e-4 with Adam. For all score model experiments, we use CFG++ (Chung et al., 2024) with w = 0.8. For Presto-S, following Yin et al. (2024) we use a fixed guidance scale of w = 4.5 throughout distillation for the teacher model as CFG++ is not applicable for the distribution matching gradient. We use 5 fake score model (and discriminator) updates per generator update, following Yin et al. (2024), as we found little change in performance when varying the quantity around 5 (though using 3 updates resulted in large training instability). Additionally, we use a learning rate of 5e-7 with Adam for both the generator and fake score model / discriminator. We set ν1 = 0.01 and ν2 = 0.005 following Yin et al. (2024).