ADIFF: Explaining audio difference using natural language
Authors: Soham Deshmukh, Shuo Han, Rita Singh, Bhiksha Raj
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper stands out as the first work to comprehensively study the task of explaining audio differences and then propose benchmark, baselines for the task. First, we present two new datasets for audio difference explanation derived from the Audio Caps and Clotho audio captioning datasets. Using Large Language Models (LLMs), we generate three levels of difference explanations: (1) concise descriptions of audio events and objects, (2) brief sentences about audio events, acoustic scenes, and signal properties, and (3) comprehensive explanations that include semantics and listener emotions. For the baseline, we use prefix tuning where audio embeddings from two audio files are used to prompt a frozen language model. Our empirical analysis and ablation studies reveal that the naive baseline struggles to distinguish perceptually similar sounds and generate detailed tier 3 explanations. To address these limitations, we propose ADIFF, which introduces a cross-projection module, position captioning, and a three-step training process to enhance the model s ability to produce detailed explanations. We evaluate our model using objective metrics and human evaluation and show our model enhancements lead to significant improvements in performance over naive baseline and So TA Audio-Language Model (ALM) Qwen Audio. Lastly, we conduct multiple ablation studies to study the effects of cross-projection, language model parameters, position captioning, third stage finetuning, and present our findings. |
| Researcher Affiliation | Academia | Soham Deshmukh Shuo Han Rita Singh Bhiksha Raj Carnegie Mellon University EMAIL |
| Pseudocode | No | The paper describes the model architecture and training process in sections 3.1, 3.2, and 3.3, and includes mathematical equations (e.g., Equation 1, 2, 3), but these descriptions are in prose and not presented as structured pseudocode or algorithm blocks with dedicated labels such as "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | The checkpoint will be publicly released1. 1Dataset and pretrained model are available at https://github.com/soham97/ADIFF |
| Open Datasets | Yes | First, we present two new datasets for audio difference explanation derived from the Audio Caps and Clotho audio captioning datasets. ... We source the audio recordings from the Audio Caps and Clotho V21 datasets. ... The study utilizes publicly available datasets, namely Audio Caps (Kim et al. (2019)) and Clotho (Drossos et et al. (2020)), which have been used in accordance with their respective licenses and ethical use guidelines. ... 1Dataset and pretrained model are available at https://github.com/soham97/ADIFF |
| Dataset Splits | Yes | The statistics of Audio Caps Difference (ACD) and Clotho Difference (CLD) datasets across three tiers: Train, Validation, and Test splits are presented in Table 1. For example, in Tier 1, the ACD Train split has 48k examples with explanations having a median of 27, a maximum of 49, and a vocabulary size of 6528, while the CLD Train split has 19k examples and the explanations have a median of 51, a maximum of 92, and a vocabulary size of 6462. |
| Hardware Specification | Yes | For stage 2, multimodal grounding, all models are trained for 30 epochs on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions using HTSAT and GPT2 models, and Adam Optimiser, but does not provide specific version numbers for any software libraries or dependencies (e.g., PyTorch version, Python version). |
| Experiment Setup | Yes | We use Adam Optimiser (Kingma & Ba (2015)) with warmup and step decay of the learning rate. For stage 2, multimodal grounding, all models are trained for 30 epochs... In the final stage of fine-tuning, we limit the training to 10 epochs... This training is conducted over a few epochs with a small learning rate of approximately 1e-6... In our experiments, we employed top-k and top-p decoding for better performance across all experiments. ... We set top-p to a high value (0.8) to limit the sampling of low-probability tokens. Additionally, we choose a top-k value of 3 |