Music Foundation Model as Generic Booster for Music Downstream Tasks

Authors: Wei-Hsiang Liao, Yuhta Takida, Yukara Ikemiya, Zhi Zhong, Chieh-Hsin Lai, Giorgio Fabbro, Kazuki Shimada, Keisuke Toyama, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Shusuke Takahashi, Stefan Uhlich, Taketo Akama, Woosung Choi, Yuichiro Koyama, Yuki Mitsufuji

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity.
Researcher Affiliation Industry 1Sony AI, Tokyo, Japan 2Sony Group Corporation, Tokyo, Japan 3Sony Europe B.V., Stuttgart, Germany 4Sony CSL, Tokyo, Japan
Pseudocode No The paper describes model architectures with figures (e.g., Figure 2: The two stages of Soni Do) and mathematical equations for its components (e.g., training objective in Section 2.1). However, it does not include a clearly labeled pseudocode block or algorithm steps formatted like code.
Open Source Code No The paper mentions baselines using open-source projects (e.g., "HTDemucs (default): Model with default settings (Dora5 signature 955717e8 )"), but does not provide an explicit statement about releasing the source code for the Soni Do model or methodology described in this paper, nor does it provide a direct link to its own code repository.
Open Datasets Yes Performance evaluation was done by benchmarking with representative tasks from understanding to generative tasks: music tagging, music transcription, music source separation, and music mixing. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing... [various datasets mentioned and cited including] Magna Tag ATune (MTAT) (Law et al., 2009), Nsynth (Engel et al., 2017), Emo Music (Soleymani et al., 2013), GTZAN (Tzanetakis & Cook, 2002), Giant Steps (Knees et al., 2015; Korzeniowski & Widmer, 2017), Vocal Set (Wilkins et al., 2018), MAPS dataset (Emiya et al., 2010), MUSDB18 Rafii et al. (2017), MDXDB21 Mitsufuji et al. (2022); Fabbro et al. (2023), Music Caps dataset (Agostinelli et al., 2023), URMP dataset (Li et al., 2019), Bach10 dataset (Duan et al., 2010), Guitar Set (Xi et al., 2018), Su (Su & Yang, 2016), and TRIOS (Fritsch, 2012).
Dataset Splits Yes When the models were trained with scarce data, the performance of the models using the Soni Do, Music Gen , and Jukebox features was superior to that of the model using the spectrogram only. When the training data size was 50 and 25%, the performance of the models using the Soni Do features was still comparable to the baseline model trained with 100% data. ... From MUSDB18, 86 songs were used for training, and 14 and 50 for validation and testing, respectively.
Hardware Specification Yes We trained for 50 epochs on one A100 graphics processing unit (GPU)... We trained the models for 50 epochs on one A100 GPU.
Software Dependencies No The paper mentions using "scikit-learn (Pedregosa et al., 2011) and mir_eval (Raffel et al., 2014) for metric computation" and "Py Torch Reduce LROn Plateu" for learning rate scheduling. It also mentions "Adam (Kingma & Ba, 2015)" and "Adam W (Loshchilov & Hutter, 2017)" optimizers. However, specific version numbers for these software libraries (e.g., scikit-learn version, PyTorch version) are not provided in the text.
Experiment Setup Yes The batch size was set to 256 for MTAT and Nsynth for their large amounts of data, and 64 for others. Unless specifically mentioned, the learning rate was 5e-5. ... trained for 50 epochs on one A100 graphics processing unit (GPU), using Adam (Kingma & Ba, 2015) optimizer with a learning rate of 1e-4. Py Torch Reduce LROn Plateu was used for learning-rate scheduling with default parameters. ... The default batch size is 32 which corresponds to 4 samples per GPU as we trained in parallel on 8 GPUs. The default number of training epochs is 360. ... the number of training epochs was increased to 720. Additionally, to match the random remixing augmentation of the default HTDemucs model, we added 860 random mixes... For the default networks, we used the suggested initial learning rate of 1e-3. For models involving Soni Do, to ensure stability, we set the initial learning rate to 1e-4. ... The loss function corresponds to the stereo-invariant loss that Martínez-Ramírez et al. (2022) reported as the best-performing, which they referred to as Lb, and consists of A-weighting pre-emphasis and low-pass finite impulse response filters, L2-norm on the spectral magnitude, and the L1-norm on the spectral log-magnitude.