reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Compose with Me: Collaborative Music Inpainter for Symbolic Music Infilling

Authors: Zhejing Hu, Yan Liu, Gong Chen, Bruce X.B. Yu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results confirm CMI s superior performance in music infilling, demonstrating its efficiency in producing high-quality music. We utilize two datasets for training: Bread2 (Peng et al. 2023) and LMD (Raffel 2016). We also conducted a case study using the Pop909 dataset (Wang et al. 2020) to demonstrate the robustness of our model. Objective Metrics. Subjective Metrics. Comparison with SOTA Models. Ablation Analysis.
Researcher Affiliation	Academia	1 Department of Computing, The Hong Kong Polytechnic University 2 Zhejiang University-University of Illinois Urbana-Champaign Institute EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in detail using prose and mathematical equations but does not include any explicit pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code https://github.com/hu-music/compose_with_me
Open Datasets	Yes	We utilize two datasets for training: Bread2 (Peng et al. 2023) and LMD (Raffel 2016). The Bread dataset, containing 851,313 polyphonic MIDI files, stands as one of the largest MIDI datasets currently accessible online. Additionally, the LMD dataset contributes an extra 176,581 polyphonic MIDI files... We also conducted a case study using the Pop909 dataset (Wang et al. 2020).
Dataset Splits	No	The paper mentions training on the Bread and LMD datasets and conducting a case study on Pop909, but it does not provide specific details on how these datasets were split into training, validation, and testing sets. It mentions a masking strategy for infilling where the ratio varies between 0.1 and 0.8 for each sample, but this is not an overall dataset split.
Hardware Specification	No	For model training, we use the RWKV-4 560M model as our generative architecture backbone on 4 GPUs. This statement mentions the quantity of GPUs but lacks specific model numbers (e.g., NVIDIA A100, Tesla V100) or other hardware details like CPU type or memory.
Software Dependencies	Yes	For model training, we use the RWKV-4 560M model as our generative architecture backbone... The context encoder is a Transformer-based model with a depth of 6 and an encoder embedding size of 512. The predictor is a Transformer-based model with a depth of 3 and an encoder embedding size of 256.
Experiment Setup	Yes	Each sequence of music is set to a length of 4096 tokens. For the masking strategy, we randomly mask tokens out of the 4096, with the ratio varying between 0.1 and 0.8 for each sample. In addition, we set the hyperparameter α to 0.1. Each model is trained with an initial learning rate of 1e-6 and a batch size of 4 for 100 epochs, following the guidelines from RWKV Music. The context encoder is a Transformer-based model with a depth of 6 and an encoder embedding size of 512. The predictor is a Transformer-based model with a depth of 3 and an encoder embedding size of 256.