reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction

Authors: Shu-Wen Yang, Byeonggeun Kim, Kuan-Po Huang, Qingming Tang, Huy Phan, Bo-Ru Lu, Harshavardhan Sundar, Shalini Ghosh, Hung-Yi Lee, Chieh-Chi Kao, Chao Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach delivers significant improvements over previous discrete solution, Audio Gen, achieving 20% and 40% relative gains on Audio Caps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. ... On Audio Caps, the innovation yields 41% and 33% relative FAD improvements over Audio Gen Base (285M) and Audio Gen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. ... 5. Experiments ... Table 1. Main results. FD, FAD, KL, IS, and CLAP metrics on the Audio Caps evaluation set. ... 5.3. Ablation studies
Researcher Affiliation	Collaboration	1Graduate Institute of Communication Engineering, National Taiwan University, Taipei, Taiwan 2Amazon AGI, Bellevue, United States. Correspondence to: Shu-wen Yang <EMAIL>, Chieh-Chi Kao <EMAIL>.
Pseudocode	No	The paper describes the methodology using textual explanations and figures (Figure 2, Figure 3) but does not include any explicitly labeled 'Pseudocode', 'Algorithm', or structured algorithm blocks.
Open Source Code	No	The paper states: 'We mostly follow the implementation in MAR (Li et al., 2024b), including the training/inference details of the MLP diffusion head, and the architecture design of MLP and the Transformer decoder.' and references 'https://github.com/LTH14/mar' in footnote 9 as the source for image MAR initialization. However, it does not explicitly state that the authors are releasing their own code for the specific methodology described in this paper.
Open Datasets	Yes	We train our model on Audio Caps (AC) (Kim et al., 2019) and Wav Caps (WC) (Mei et al., 2024).
Dataset Splits	No	We train our model on Audio Caps (AC) (Kim et al., 2019) and Wav Caps (WC) (Mei et al., 2024). ... We evaluate our model on the AC evaluation set... Audios longer than 10 seconds are randomly cropped into 10 second. That is, the number of the text-audio pairs are the same after the pre-processing. While an evaluation set is mentioned and audio cropping is described, specific train/validation/test split percentages or sample counts for the datasets are not provided.
Hardware Specification	Yes	We train the Base model with 40 NVIDIA V100 GPUs, and the Large model requires 104.
Software Dependencies	No	The paper mentions using Adam W optimizer and references specific prior works for diffusion process details and Transformer implementation (e.g., 'Transformer (Vaswani, 2017) implementation in Vi T (Wang et al., 2021)'), but it does not provide specific version numbers for any software libraries or dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We use Adam W (Loshchilov & Hutter, 2017) optimizer with a fixed learning rate 1.0 × 10−4. We train the Base model with 40 NVIDIA V100 GPUs... Our effective batch size is 2048 10-second clips. We train the Base and the Large model for 1000 epochs, about 2 days and 5 days, respectively. ... We set ω0 = 7 as default ... we set τ = 1 as the default.