Presto! Distilling Steps and Layers for Accelerating Music Generation
Authors: Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1k Hz, 15x faster than the comparable SOTA model) the fastest TTM to our knowledge. We show the efficacy of Presto via a number of experiments. We first ablate the design choices afforded by Presto-S, and separately show how Presto-L flatly improves standard diffusion sampling. We then show how Presto-L and Presto-S stack up against SOTA baselines, and how we can combine such approaches for further acceleration, with both quantitative and subjective metrics. |
| Researcher Affiliation | Collaboration | Zachary Novack UC San Diego Ge Zhu & Jonah Casebeer Adobe Research Julian Mc Auley & Taylor Berg-Kirkpatrick UC San Diego Nicholas J. Bryan Adobe Research |
| Pseudocode | Yes | We develop our score-based distribution matching step distillation, Presto-S below and in Fig. 1 as well as the algorithm in Appendix A.3, a pseudo-code walkthrough in Appendix A.4, and expanded visualization in Appendix A.5. |
| Open Source Code | No | Following these concerns, we do not plan to release our model, but have done our best to compare against multiple open source baselines and/or re-train alternative methods for comparison and in-depth understanding of the reproducible insights of our work. |
| Open Datasets | Yes | For evaluation, we use Song Describer (no vocals) (Manco et al., 2023) split into 32 second chunks. |
| Dataset Splits | No | We use a 3.6K hour dataset of mono 44.1 k Hz licensed instrumental music, augmented with pitch-shifting and time-stretching. Data includes musical meta-data and synthetic captions. For evaluation, we use Song Describer (no vocals) (Manco et al., 2023) split into 32 second chunks. |
| Hardware Specification | Yes | The base model was trained for 5 days across 32 A100 GPUs with a batch size of 14 and learning rate of 1e-4 with Adam. For all step distillation methods, we distill each model with a batch size of 80 across 16 Nvidia A100 GPUs for 32K iterations. We train all layer distillation methods for 60K iterations with a batch size of 12 across 16 A100 GPUs with a learning rate of 8e-5. For Presto-L, we set ν = 0.1. On an Intel Xeon Platinum 8275CL CPU, we achieve a mono RTF of 0.74, generating 32 seconds of audio in 43.34 seconds. |
| Software Dependencies | Yes | We use Flash Attention-2 (Dao, 2023) for the Di T and Pytorch 2.0 s built in graph compilation (Ansel et al., 2024) for the VAE decoder and Music Hifi mono-to-stereo. |
| Experiment Setup | Yes | Specifically, we set σdata = 0.5, Pmean = 0.4, Pstd = 1.0, σmax = 80, σmin = 0.002. We train the base model with 10% condition dropout to enable CFG. The base model was trained for 5 days across 32 A100 GPUs with a batch size of 14 and learning rate of 1e-4 with Adam. For all score model experiments, we use CFG++ (Chung et al., 2024) with w = 0.8. For Presto-S, following Yin et al. (2024) we use a fixed guidance scale of w = 4.5 throughout distillation for the teacher model as CFG++ is not applicable for the distribution matching gradient. We use 5 fake score model (and discriminator) updates per generator update, following Yin et al. (2024), as we found little change in performance when varying the quantity around 5 (though using 3 updates resulted in large training instability). Additionally, we use a learning rate of 5e-7 with Adam for both the generator and fake score model / discriminator. We set ν1 = 0.01 and ν2 = 0.005 following Yin et al. (2024). |