MATS: An Audio Language Model under Text-only Supervision

Authors: Wen Wang, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that MATS, despite being trained exclusively on text data, achieves competitive performance compared to recent LALMs trained on large-scale audio-language pairs. We conduct a comprehensive evaluation of our model on both close-ended and open-ended tasks.
Researcher Affiliation Academia 1Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China 2University of Chinese Academy of Sciences, China. Correspondence to: Ruibing Hou <EMAIL>.
Pseudocode No The paper describes methods using mathematical formulations and prose (e.g., Section 3.1 Formulations, Section 3.4 Santa Mechanism) but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code is publicly available in https://github.com/wangwenbanban/MATS.
Open Datasets Yes MATS is designed to handle a wide range of audio tasks relying solely on text supervision, which makes existing audio datasets unsuitable for direct use in its training. In this work, we construct a new dataset Audio TIA-5M... For the audio captioning task (CAP), we directly use audio caption annotations from public datasets, including Clotho-v2 (Drossos et al., 2020), Audio Caps (Kim et al., 2019), Music Caps (Agostinelli et al., 2023), Wav Caps (Mei et al., 2024) and Macs (Mart ın-Morat o & Mesaros, 2021), as text and answer. To further enhance model s understanding and reasoning capabilities for audio, we incorporate two open-ended datasets: Open AQA (Gong et al., 2024) and Music QA (Liu et al., 2024a).
Dataset Splits Yes Overall, Audio TIA-5M comprises two subsets: a 1.5 M close-ended question subset, and a 3.8 M open-ended question subset. Our model is trained on Audio TIA-5M dataset, which integrates both close-ended and open-ended tasks. Table S10 provides a detailed overview of the test benchmark and corresponding evaluation metrics.
Hardware Specification Yes MATS-GPT2 underwent about 25 hours of training over 90,000 iterations on 2 A100 GPUs.
Software Dependencies No The paper mentions using models like GPT2 and LLaMA, and an AdamW optimizer, but does not specify software dependencies with version numbers (e.g., Python, PyTorch versions).
Experiment Setup Yes The audio is sampled at 44.1 k Hz and converted into a log Mel spectrograms with 64 Mel bins, a hop size of 320 ms, and a window size of 1024 ms. All audio files are randomly truncated to 7 seconds in length with CLAP audio encoder. The mapping module uses a 8-layer transformer with a prefix length of 40. Training is conducted using the Adam W optimizer with a learning rate of 5 10 5, a linear learning rate scheduler with 2000 warmup steps. The batch size is set to 128. For the Lo RA configuration, we set the rank to 8, the scaling factor to 4, and the dropout rate to 0.1. The hyperparameters are configured as follows: σ = 0.015, K = 100, L = 32, τ = 0.1 and λ = 0.3 (Equation 7).