reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FLAM: Frame-Wise Language-Audio Modeling

Authors: Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, Justin Salamon

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.
Researcher Affiliation	Collaboration	1Adobe Research 2Mila Quebec AI Institute, Universit e de Montr eal 3Massachusetts Institute of Technology 4Canada CIFAR AI Chair. Correspondence to: Yusong Wu <EMAIL>, Justin Salamon <EMAIL>.
Pseudocode	No	The paper describes the methodology in prose and mathematical formulations (e.g., LSED, h(x, l, y) in Section 3.1), but does not present a distinct pseudocode or algorithm block.
Open Source Code	No	The paper mentions releasing a dataset for benchmarking: 'We publicly release the ASFX-SED dataset for future benchmarking: http://flam-model.github.io/asfx-sed. html.', but does not explicitly provide a link to the source code for the FLAM model or state that it will be released.
Open Datasets	Yes	We gather a large mix of licensed proprietary sound effect datasets and publicly available CC-licensed general audio datasets, consisting of approximately 1.1M audio samples with corresponding metadata... Additionally, we sample from synthetic SED data, Audio Set-strong, Urban SED, and DESED with weights (0.5, 0.5, 0.1, 0.1). We publicly release the ASFX-SED dataset for future benchmarking: http://flam-model.github.io/asfx-sed. html.
Dataset Splits	Yes	We generate 1 Million mixtures for training using the augmentation procedure where each mixture has a length of 10 seconds. We hold out 5k backgrounds and 10k events, and make 10k mixtures from the held-out events as our primary test set (Held-out). For additional evaluation of generalization, we create another 10k test mixtures, ASFX-SED, using sound effects from the Adobe Audition SFX Library (ASFX)(Wilkins et al., 2023) that were entirely unseen during training.
Hardware Specification	No	The paper mentions 'each of the N GPU GPUs processes its local subset of audio frames and text prompts' in the context of memory-efficient training, but does not provide any specific details about the GPU models, CPU, or other hardware used for the experiments.
Software Dependencies	No	The paper mentions using specific models like HTSAT and RoBERTa, and optimizers like Adam, but it does not specify the version numbers of any software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions) required to replicate the experiment.
Experiment Setup	Yes	We use a batch size of 768, a learning rate of 10 4, and an Adam optimizer with β1 = 0.9, β2 = 0.99. The learning rate schedule employs cosine warmup (3200 steps) followed by linear decay, for a total of 50,000 steps... We train FLAM with a batch size of 512 and a learning rate of 10 4, using Adam (β1 = 0.9, β2 = 0.99) and the same warmup-then-decay schedule with 3200 steps of warmup and train 120,000 steps.