reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ADIFF: Explaining audio difference using natural language

Authors: Soham Deshmukh, Shuo Han, Rita Singh, Bhiksha Raj

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper stands out as the first work to comprehensively study the task of explaining audio differences and then propose benchmark, baselines for the task. First, we present two new datasets for audio difference explanation derived from the Audio Caps and Clotho audio captioning datasets. Using Large Language Models (LLMs), we generate three levels of difference explanations: (1) concise descriptions of audio events and objects, (2) brief sentences about audio events, acoustic scenes, and signal properties, and (3) comprehensive explanations that include semantics and listener emotions. For the baseline, we use prefix tuning where audio embeddings from two audio files are used to prompt a frozen language model. Our empirical analysis and ablation studies reveal that the naive baseline struggles to distinguish perceptually similar sounds and generate detailed tier 3 explanations. To address these limitations, we propose ADIFF, which introduces a cross-projection module, position captioning, and a three-step training process to enhance the model s ability to produce detailed explanations. We evaluate our model using objective metrics and human evaluation and show our model enhancements lead to significant improvements in performance over naive baseline and So TA Audio-Language Model (ALM) Qwen Audio. Lastly, we conduct multiple ablation studies to study the effects of cross-projection, language model parameters, position captioning, third stage finetuning, and present our findings.
Researcher Affiliation	Academia	Soham Deshmukh Shuo Han Rita Singh Bhiksha Raj Carnegie Mellon University EMAIL
Pseudocode	No	The paper describes the model architecture and training process in sections 3.1, 3.2, and 3.3, and includes mathematical equations (e.g., Equation 1, 2, 3), but these descriptions are in prose and not presented as structured pseudocode or algorithm blocks with dedicated labels such as "Pseudocode" or "Algorithm".
Open Source Code	Yes	The checkpoint will be publicly released1. 1Dataset and pretrained model are available at https://github.com/soham97/ADIFF
Open Datasets	Yes	First, we present two new datasets for audio difference explanation derived from the Audio Caps and Clotho audio captioning datasets. ... We source the audio recordings from the Audio Caps and Clotho V21 datasets. ... The study utilizes publicly available datasets, namely Audio Caps (Kim et al. (2019)) and Clotho (Drossos et et al. (2020)), which have been used in accordance with their respective licenses and ethical use guidelines. ... 1Dataset and pretrained model are available at https://github.com/soham97/ADIFF
Dataset Splits	Yes	The statistics of Audio Caps Difference (ACD) and Clotho Difference (CLD) datasets across three tiers: Train, Validation, and Test splits are presented in Table 1. For example, in Tier 1, the ACD Train split has 48k examples with explanations having a median of 27, a maximum of 49, and a vocabulary size of 6528, while the CLD Train split has 19k examples and the explanations have a median of 51, a maximum of 92, and a vocabulary size of 6462.
Hardware Specification	Yes	For stage 2, multimodal grounding, all models are trained for 30 epochs on 8 A100 GPUs.
Software Dependencies	No	The paper mentions using HTSAT and GPT2 models, and Adam Optimiser, but does not provide specific version numbers for any software libraries or dependencies (e.g., PyTorch version, Python version).
Experiment Setup	Yes	We use Adam Optimiser (Kingma & Ba (2015)) with warmup and step decay of the learning rate. For stage 2, multimodal grounding, all models are trained for 30 epochs... In the final stage of fine-tuning, we limit the training to 10 epochs... This training is conducted over a few epochs with a small learning rate of approximately 1e-6... In our experiments, we employed top-k and top-p decoding for better performance across all experiments. ... We set top-p to a high value (0.8) to limit the sampling of low-probability tokens. Additionally, we choose a top-k value of 3