Supervised Contrastive Learning from Weakly-Labeled Audio Segments for Musical Version Matching

Authors: Joan Serrà, R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we propose a method to learn from weakly annotated segments, together with a contrastive loss variant that outperforms well-studied alternatives. The former is based on pairwise segment distance reductions, while the latter modifies an existing loss following decoupling, hyper-parameter, and geometric considerations. With these two elements, we do not only achieve state-of-the-art results in the standard track-level evaluation, but we also obtain a breakthrough performance in a segment-level evaluation. We also perform an extensive ablation study to empirically compare the proposed methods with several alternatives, including additional reduction strategies and common contrastive losses.
Researcher Affiliation Collaboration 1Sony AI 2Music Technology Group, Universitat Pompeu Fabra 3Sony Group Corporation. Correspondence to: Joan Serr a <EMAIL>.
Pseudocode No The paper describes the methodology using mathematical formulations (e.g., equations for distance reduction and loss functions) and textual descriptions, but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes To facilitate understanding and reproduction, we share our code and model checkpoints in https://github.com/sony/clews.
Open Datasets Yes We train and evaluate all models on the publicly-available data sets Discogs VI-YT (DVI; Araz et al., 2024a) and SHS100k-v2 (SHS; Yu et al., 2020), using the predefined partitions.
Dataset Splits Yes We train and evaluate all models on the publicly-available data sets Discogs VI-YT (DVI; Araz et al., 2024a) and SHS100k-v2 (SHS; Yu et al., 2020), using the predefined partitions. ... In every epoch, we group all tracks into batches of 25 anchors and, for each of them, we uniformly sample with replacement 3 positives from the corresponding version group (excluding the anchor). Thus, we get an initial (track-based) batch size of 100. For every track in the batch, we uniformly sample 2.5 min from the full-length music track and create the aforementioned 8 segments per track. Thus, we get a final (segment-based) batch size of 800.
Hardware Specification Yes Using this strategy, training CLEWS on SHS and DVI takes approximately 2 and 9 days, respectively, using two NVIDIA H100-80GB GPUs.
Software Dependencies Yes We apply a constant-Q transform (CQT) with 20 ms hop size, spanning 7 octaves (from a minimum frequency of 32.7 Hz), and with 12 bins per octave (we use the nn Audio library7 in non-trainable mode, with the rest of the parameters set as default). ... Unless stated otherwise, we use the default Py Torch8 parameters from version 2.3.1.
Experiment Setup Yes We train CLEWS with R+ = Rbpwr-5, R = Rmin, γ = 5, and ε = 10 6 as defaults, and study the effect of such choices in Sec. 5. Since test sets also contain tracks longer than the 2.5 min used for training, in CLEWS we use our proposed Rbpwr-10 for track matching, together with a segment hop size of 5 s. We train all models with Adam using a learning rate of 2 10 4, following a reduce-on-plateau schedule with a 10epoch patience and an annealing factor of 0.2. The only exception is in ablation experiments, where we train for 20 epochs featuring a final 5-epoch polynomial learning rate annealing. In every epoch, we group all tracks into batches of 25 anchors and, for each of them, we uniformly sample with replacement 3 positives from the corresponding version group (excluding the anchor). Thus, we get an initial (track-based) batch size of 100. ... We only use time stretch, pitch roll, and Spec Augment augmentations (Liu et al., 2023).