FLAM: Frame-Wise Language-Audio Modeling
Authors: Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, Justin Salamon
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks. |
| Researcher Affiliation | Collaboration | 1Adobe Research 2Mila Quebec AI Institute, Universit e de Montr eal 3Massachusetts Institute of Technology 4Canada CIFAR AI Chair. Correspondence to: Yusong Wu <EMAIL>, Justin Salamon <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in prose and mathematical formulations (e.g., LSED, h(x, l, y) in Section 3.1), but does not present a distinct pseudocode or algorithm block. |
| Open Source Code | No | The paper mentions releasing a dataset for benchmarking: 'We publicly release the ASFX-SED dataset for future benchmarking: http://flam-model.github.io/asfx-sed. html.', but does not explicitly provide a link to the source code for the FLAM model or state that it will be released. |
| Open Datasets | Yes | We gather a large mix of licensed proprietary sound effect datasets and publicly available CC-licensed general audio datasets, consisting of approximately 1.1M audio samples with corresponding metadata... Additionally, we sample from synthetic SED data, Audio Set-strong, Urban SED, and DESED with weights (0.5, 0.5, 0.1, 0.1). We publicly release the ASFX-SED dataset for future benchmarking: http://flam-model.github.io/asfx-sed. html. |
| Dataset Splits | Yes | We generate 1 Million mixtures for training using the augmentation procedure where each mixture has a length of 10 seconds. We hold out 5k backgrounds and 10k events, and make 10k mixtures from the held-out events as our primary test set (Held-out). For additional evaluation of generalization, we create another 10k test mixtures, ASFX-SED, using sound effects from the Adobe Audition SFX Library (ASFX)(Wilkins et al., 2023) that were entirely unseen during training. |
| Hardware Specification | No | The paper mentions 'each of the N GPU GPUs processes its local subset of audio frames and text prompts' in the context of memory-efficient training, but does not provide any specific details about the GPU models, CPU, or other hardware used for the experiments. |
| Software Dependencies | No | The paper mentions using specific models like HTSAT and RoBERTa, and optimizers like Adam, but it does not specify the version numbers of any software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions) required to replicate the experiment. |
| Experiment Setup | Yes | We use a batch size of 768, a learning rate of 10 4, and an Adam optimizer with β1 = 0.9, β2 = 0.99. The learning rate schedule employs cosine warmup (3200 steps) followed by linear decay, for a total of 50,000 steps... We train FLAM with a batch size of 512 and a learning rate of 10 4, using Adam (β1 = 0.9, β2 = 0.99) and the same warmup-then-decay schedule with 3200 steps of warmup and train 120,000 steps. |