Compose with Me: Collaborative Music Inpainter for Symbolic Music Infilling
Authors: Zhejing Hu, Yan Liu, Gong Chen, Bruce X.B. Yu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results confirm CMI s superior performance in music infilling, demonstrating its efficiency in producing high-quality music. We utilize two datasets for training: Bread2 (Peng et al. 2023) and LMD (Raffel 2016). We also conducted a case study using the Pop909 dataset (Wang et al. 2020) to demonstrate the robustness of our model. Objective Metrics. Subjective Metrics. Comparison with SOTA Models. Ablation Analysis. |
| Researcher Affiliation | Academia | 1 Department of Computing, The Hong Kong Polytechnic University 2 Zhejiang University-University of Illinois Urbana-Champaign Institute EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology in detail using prose and mathematical equations but does not include any explicit pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | Code https://github.com/hu-music/compose_with_me |
| Open Datasets | Yes | We utilize two datasets for training: Bread2 (Peng et al. 2023) and LMD (Raffel 2016). The Bread dataset, containing 851,313 polyphonic MIDI files, stands as one of the largest MIDI datasets currently accessible online. Additionally, the LMD dataset contributes an extra 176,581 polyphonic MIDI files... We also conducted a case study using the Pop909 dataset (Wang et al. 2020). |
| Dataset Splits | No | The paper mentions training on the Bread and LMD datasets and conducting a case study on Pop909, but it does not provide specific details on how these datasets were split into training, validation, and testing sets. It mentions a masking strategy for infilling where the ratio varies between 0.1 and 0.8 for each sample, but this is not an overall dataset split. |
| Hardware Specification | No | For model training, we use the RWKV-4 560M model as our generative architecture backbone on 4 GPUs. This statement mentions the quantity of GPUs but lacks specific model numbers (e.g., NVIDIA A100, Tesla V100) or other hardware details like CPU type or memory. |
| Software Dependencies | Yes | For model training, we use the RWKV-4 560M model as our generative architecture backbone... The context encoder is a Transformer-based model with a depth of 6 and an encoder embedding size of 512. The predictor is a Transformer-based model with a depth of 3 and an encoder embedding size of 256. |
| Experiment Setup | Yes | Each sequence of music is set to a length of 4096 tokens. For the masking strategy, we randomly mask tokens out of the 4096, with the ratio varying between 0.1 and 0.8 for each sample. In addition, we set the hyperparameter α to 0.1. Each model is trained with an initial learning rate of 1e-6 and a batch size of 4 for 100 epochs, following the guidelines from RWKV Music. The context encoder is a Transformer-based model with a depth of 6 and an encoder embedding size of 512. The predictor is a Transformer-based model with a depth of 3 and an encoder embedding size of 256. |