Compose with Me: Collaborative Music Inpainter for Symbolic Music Infilling

Authors: Zhejing Hu, Yan Liu, Gong Chen, Bruce X.B. Yu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results confirm CMI s superior performance in music infilling, demonstrating its efficiency in producing high-quality music. We utilize two datasets for training: Bread2 (Peng et al. 2023) and LMD (Raffel 2016). We also conducted a case study using the Pop909 dataset (Wang et al. 2020) to demonstrate the robustness of our model. Objective Metrics. Subjective Metrics. Comparison with SOTA Models. Ablation Analysis.
Researcher Affiliation Academia 1 Department of Computing, The Hong Kong Polytechnic University 2 Zhejiang University-University of Illinois Urbana-Champaign Institute EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology in detail using prose and mathematical equations but does not include any explicit pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Code https://github.com/hu-music/compose_with_me
Open Datasets Yes We utilize two datasets for training: Bread2 (Peng et al. 2023) and LMD (Raffel 2016). The Bread dataset, containing 851,313 polyphonic MIDI files, stands as one of the largest MIDI datasets currently accessible online. Additionally, the LMD dataset contributes an extra 176,581 polyphonic MIDI files... We also conducted a case study using the Pop909 dataset (Wang et al. 2020).
Dataset Splits No The paper mentions training on the Bread and LMD datasets and conducting a case study on Pop909, but it does not provide specific details on how these datasets were split into training, validation, and testing sets. It mentions a masking strategy for infilling where the ratio varies between 0.1 and 0.8 for each sample, but this is not an overall dataset split.
Hardware Specification No For model training, we use the RWKV-4 560M model as our generative architecture backbone on 4 GPUs. This statement mentions the quantity of GPUs but lacks specific model numbers (e.g., NVIDIA A100, Tesla V100) or other hardware details like CPU type or memory.
Software Dependencies Yes For model training, we use the RWKV-4 560M model as our generative architecture backbone... The context encoder is a Transformer-based model with a depth of 6 and an encoder embedding size of 512. The predictor is a Transformer-based model with a depth of 3 and an encoder embedding size of 256.
Experiment Setup Yes Each sequence of music is set to a length of 4096 tokens. For the masking strategy, we randomly mask tokens out of the 4096, with the ratio varying between 0.1 and 0.8 for each sample. In addition, we set the hyperparameter α to 0.1. Each model is trained with an initial learning rate of 1e-6 and a batch size of 4 for 100 epochs, following the guidelines from RWKV Music. The context encoder is a Transformer-based model with a depth of 6 and an encoder embedding size of 512. The predictor is a Transformer-based model with a depth of 3 and an encoder embedding size of 256.