Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction

Authors: Shu-Wen Yang, Byeonggeun Kim, Kuan-Po Huang, Qingming Tang, Huy Phan, Bo-Ru Lu, Harshavardhan Sundar, Shalini Ghosh, Hung-Yi Lee, Chieh-Chi Kao, Chao Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach delivers significant improvements over previous discrete solution, Audio Gen, achieving 20% and 40% relative gains on Audio Caps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. ... On Audio Caps, the innovation yields 41% and 33% relative FAD improvements over Audio Gen Base (285M) and Audio Gen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. ... 5. Experiments ... Table 1. Main results. FD, FAD, KL, IS, and CLAP metrics on the Audio Caps evaluation set. ... 5.3. Ablation studies
Researcher Affiliation Collaboration 1Graduate Institute of Communication Engineering, National Taiwan University, Taipei, Taiwan 2Amazon AGI, Bellevue, United States. Correspondence to: Shu-wen Yang <EMAIL>, Chieh-Chi Kao <EMAIL>.
Pseudocode No The paper describes the methodology using textual explanations and figures (Figure 2, Figure 3) but does not include any explicitly labeled 'Pseudocode', 'Algorithm', or structured algorithm blocks.
Open Source Code No The paper states: 'We mostly follow the implementation in MAR (Li et al., 2024b), including the training/inference details of the MLP diffusion head, and the architecture design of MLP and the Transformer decoder.' and references 'https://github.com/LTH14/mar' in footnote 9 as the source for image MAR initialization. However, it does not explicitly state that the authors are releasing their own code for the specific methodology described in this paper.
Open Datasets Yes We train our model on Audio Caps (AC) (Kim et al., 2019) and Wav Caps (WC) (Mei et al., 2024).
Dataset Splits No We train our model on Audio Caps (AC) (Kim et al., 2019) and Wav Caps (WC) (Mei et al., 2024). ... We evaluate our model on the AC evaluation set... Audios longer than 10 seconds are randomly cropped into 10 second. That is, the number of the text-audio pairs are the same after the pre-processing. While an evaluation set is mentioned and audio cropping is described, specific train/validation/test split percentages or sample counts for the datasets are not provided.
Hardware Specification Yes We train the Base model with 40 NVIDIA V100 GPUs, and the Large model requires 104.
Software Dependencies No The paper mentions using Adam W optimizer and references specific prior works for diffusion process details and Transformer implementation (e.g., 'Transformer (Vaswani, 2017) implementation in Vi T (Wang et al., 2021)'), but it does not provide specific version numbers for any software libraries or dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We use Adam W (Loshchilov & Hutter, 2017) optimizer with a fixed learning rate 1.0 × 10−4. We train the Base model with 40 NVIDIA V100 GPUs... Our effective batch size is 2048 10-second clips. We train the Base and the Large model for 1000 epochs, about 2 days and 5 days, respectively. ... We set ω0 = 7 as default ... we set τ = 1 as the default.