Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction
Authors: Shu-Wen Yang, Byeonggeun Kim, Kuan-Po Huang, Qingming Tang, Huy Phan, Bo-Ru Lu, Harshavardhan Sundar, Shalini Ghosh, Hung-Yi Lee, Chieh-Chi Kao, Chao Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach delivers significant improvements over previous discrete solution, Audio Gen, achieving 20% and 40% relative gains on Audio Caps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. ... On Audio Caps, the innovation yields 41% and 33% relative FAD improvements over Audio Gen Base (285M) and Audio Gen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. ... 5. Experiments ... Table 1. Main results. FD, FAD, KL, IS, and CLAP metrics on the Audio Caps evaluation set. ... 5.3. Ablation studies |
| Researcher Affiliation | Collaboration | 1Graduate Institute of Communication Engineering, National Taiwan University, Taipei, Taiwan 2Amazon AGI, Bellevue, United States. Correspondence to: Shu-wen Yang <EMAIL>, Chieh-Chi Kao <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using textual explanations and figures (Figure 2, Figure 3) but does not include any explicitly labeled 'Pseudocode', 'Algorithm', or structured algorithm blocks. |
| Open Source Code | No | The paper states: 'We mostly follow the implementation in MAR (Li et al., 2024b), including the training/inference details of the MLP diffusion head, and the architecture design of MLP and the Transformer decoder.' and references 'https://github.com/LTH14/mar' in footnote 9 as the source for image MAR initialization. However, it does not explicitly state that the authors are releasing their own code for the specific methodology described in this paper. |
| Open Datasets | Yes | We train our model on Audio Caps (AC) (Kim et al., 2019) and Wav Caps (WC) (Mei et al., 2024). |
| Dataset Splits | No | We train our model on Audio Caps (AC) (Kim et al., 2019) and Wav Caps (WC) (Mei et al., 2024). ... We evaluate our model on the AC evaluation set... Audios longer than 10 seconds are randomly cropped into 10 second. That is, the number of the text-audio pairs are the same after the pre-processing. While an evaluation set is mentioned and audio cropping is described, specific train/validation/test split percentages or sample counts for the datasets are not provided. |
| Hardware Specification | Yes | We train the Base model with 40 NVIDIA V100 GPUs, and the Large model requires 104. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and references specific prior works for diffusion process details and Transformer implementation (e.g., 'Transformer (Vaswani, 2017) implementation in Vi T (Wang et al., 2021)'), but it does not provide specific version numbers for any software libraries or dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We use Adam W (Loshchilov & Hutter, 2017) optimizer with a fixed learning rate 1.0 × 10−4. We train the Base model with 40 NVIDIA V100 GPUs... Our effective batch size is 2048 10-second clips. We train the Base and the Large model for 1000 epochs, about 2 days and 5 days, respectively. ... We set ω0 = 7 as default ... we set τ = 1 as the default. |