IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling

Authors: Kuan-Po Huang, Shu-Wen Yang, Huy Phan, Bo-Ru Lu, Byeonggeun Kim, Sashank Macha, Qingming Tang, Shalini Ghosh, Hung-Yi Lee, Chieh-Chi Kao, Chao Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results on Audio Caps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Fr echet Distance (FD) and Fr echet Audio Distance (FAD) while significantly reducing latency compared to prior models. The project website is available at https://audio-impact.github.io/.
Researcher Affiliation Collaboration 1National Taiwan University, Taipei, Taiwan 2Amazon AGI, United States. Correspondence to: Kuan-Po Huang <EMAIL>, Chieh-Chi Kao <EMAIL>.
Pseudocode No The paper describes the training and inference phases with textual descriptions and diagrams (Figure 1), but no explicit pseudocode blocks or algorithms are provided.
Open Source Code No The project website is available at https://audio-impact.github.io/. This is a project website, not a direct link to the source code for the methodology described in the paper. The text mentions 'Checkpoints can be accessed from the official third-party MAR github repository https://github.com/LTH14/mar?tab=readme-ovfile#preparation' which is for a related work (MAR), not for IMPACT's code.
Open Datasets Yes Specifically, we employ the Audio Caps (AC) (Kim et al., 2019) training split, which contains 145 hours of audio, and a combined dataset of Audio Caps (AC) and Wav Caps (WC) (Mei et al., 2024), totaling 1200 hours of audio. Although Audio Set (AS) (Gemmeke et al., 2017) is currently the largest audio dataset which has about 5500 hours of audio data, since most of the audio samples in AS do not have text descriptions, this dataset is only used for unconditional pre-training.
Dataset Splits Yes Specifically, we employ the Audio Caps (AC) (Kim et al., 2019) training split, which contains 145 hours of audio, and a combined dataset of Audio Caps (AC) and Wav Caps (WC) (Mei et al., 2024), totaling 1200 hours of audio. For evaluation, we evaluate our text-to-audio generation model on the AC evaluation set.
Hardware Specification Yes To evaluate inference speed, we measure the latency, also referred to as inference time, reported in seconds for generating a batch of audio samples on a single Tesla V100 GPU with 32GB VRAM.
Software Dependencies No The paper mentions software components like 'Adam W optimizer', 'CLAP', 'Flan-T5', and evaluation packages like 'audioldm eval' and 'clap-htsat-fused', but does not provide specific version numbers for any of these.
Experiment Setup Yes During text-conditional training, we set the masking percentage factor q = 0.7. For all model training, we adopt the Adam W optimizer and set the learning rate to 5e 5. For inference, by default, the total number of decoding iterations T is set to 64 unless otherwise specified. For classifier free guidance, we list the details in Appendix A. During training, the maximum number of diffusion steps ˆTmax is set to 1000. During inference, the total number of diffusion sampling steps ˆT is set to 100 unless specified explicitly. The base configuration uses an embedding dimension D of 768 and incorporates 24 transformer layers in the latent encoder. In contrast, the large configuration increases the embedding dimension to D = 1024 and employs 32 transformer layers in the encoder. The max cfg scaler βcfg max is set to 5.0 by default unless specified.