reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MATS: An Audio Language Model under Text-only Supervision

Authors: Wen Wang, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that MATS, despite being trained exclusively on text data, achieves competitive performance compared to recent LALMs trained on large-scale audio-language pairs. We conduct a comprehensive evaluation of our model on both close-ended and open-ended tasks.
Researcher Affiliation	Academia	1Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China 2University of Chinese Academy of Sciences, China. Correspondence to: Ruibing Hou <EMAIL>.
Pseudocode	No	The paper describes methods using mathematical formulations and prose (e.g., Section 3.1 Formulations, Section 3.4 Santa Mechanism) but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is publicly available in https://github.com/wangwenbanban/MATS.
Open Datasets	Yes	MATS is designed to handle a wide range of audio tasks relying solely on text supervision, which makes existing audio datasets unsuitable for direct use in its training. In this work, we construct a new dataset Audio TIA-5M... For the audio captioning task (CAP), we directly use audio caption annotations from public datasets, including Clotho-v2 (Drossos et al., 2020), Audio Caps (Kim et al., 2019), Music Caps (Agostinelli et al., 2023), Wav Caps (Mei et al., 2024) and Macs (Mart ın-Morat o & Mesaros, 2021), as text and answer. To further enhance model s understanding and reasoning capabilities for audio, we incorporate two open-ended datasets: Open AQA (Gong et al., 2024) and Music QA (Liu et al., 2024a).
Dataset Splits	Yes	Overall, Audio TIA-5M comprises two subsets: a 1.5 M close-ended question subset, and a 3.8 M open-ended question subset. Our model is trained on Audio TIA-5M dataset, which integrates both close-ended and open-ended tasks. Table S10 provides a detailed overview of the test benchmark and corresponding evaluation metrics.
Hardware Specification	Yes	MATS-GPT2 underwent about 25 hours of training over 90,000 iterations on 2 A100 GPUs.
Software Dependencies	No	The paper mentions using models like GPT2 and LLaMA, and an AdamW optimizer, but does not specify software dependencies with version numbers (e.g., Python, PyTorch versions).
Experiment Setup	Yes	The audio is sampled at 44.1 k Hz and converted into a log Mel spectrograms with 64 Mel bins, a hop size of 320 ms, and a window size of 1024 ms. All audio files are randomly truncated to 7 seconds in length with CLAP audio encoder. The mapping module uses a 8-layer transformer with a prefix length of 40. Training is conducted using the Adam W optimizer with a learning rate of 5 10 5, a linear learning rate scheduler with 2000 warmup steps. The batch size is set to 128. For the Lo RA configuration, we set the rank to 8, the scaling factor to 4, and the dropout rate to 0.1. The hyperparameters are configured as follows: σ = 0.015, K = 100, L = 32, τ = 0.1 and λ = 0.3 (Equation 7).