MIDI-GPT: A Controllable Generative Model for Computer-Assisted Multitrack Music Composition

Authors: Philippe Pasquier, Jeff Ens, Nathan Fradet, Paul Triana, Davide Rizzotti, Jean-Baptiste Rolland, Maryam Safi

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present experimental results that demonstrate that MIDI-GPT is able to consistently avoid duplicating the musical material it was trained on, generate music that is stylistically similar to the training dataset, and that attribute controls allow enforcing various constraints on the generated material. We also outline several real-world applications of MIDI-GPT, including collaborations with industry partners that explore the integration and evaluation of MIDI-GPT into commercial products, as well as several artistic works produced using it.
Researcher Affiliation Collaboration Philippe Pasquier1, Jeff Ens1, Nathan Fradet1, Paul Triana1, Davide Rizzotti1, Jean-Baptiste Rolland2, Maryam Safi2 1Metacreation Lab Simon Fraser University, Vancouver, Canada 2Steinberg Media Technologies Gmb H, Hamburg, Germany EMAIL
Pseudocode No No explicit pseudocode or algorithm blocks are provided in the main text. Figures 1 and 2 illustrate tokenization schemes but are not pseudocode.
Open Source Code Yes MIDI-GPT has been released1 and is seeing real-world usage in several contexts, which directly supports our assertion that MIDI-GPT is a practical model for computer-assisted composition. 1https://www.metacreation.net/projects/mmm links to models and various examples of generations. We present MIDI-GPT, a style-agnostic generative system released as an Open RAIL-M licenced MMM model (Ens and Pasquier 2020).
Open Datasets Yes We use the new Giga MIDI (Lee et al. 2024) dataset, which builds on the Meta MIDI dataset (Ens and Pasquier 2021), to train with a split of: ptrain = 80%, pvalid = 10%, and ptest = 10%.
Dataset Splits Yes We use the new Giga MIDI (Lee et al. 2024) dataset, which builds on the Meta MIDI dataset (Ens and Pasquier 2021), to train with a split of: ptrain = 80%, pvalid = 10%, and ptest = 10%.
Hardware Specification Yes Training to convergence typically takes 2-3 days using 4 V100 GPUs.
Software Dependencies No Our model is built on the GPT2 architecture (Radford et al. 2019), implemented using the Hugging Face Transformers library (Wolf et al. 2020). This tokenization is implemented in Midi Tok (Fradet et al. 2021) for ease of use. No specific version numbers are provided for these libraries.
Experiment Setup Yes The configuration of this model includes 8 attention heads and 6 layers, utilizing an embedding size of 512 and an attention window encompassing 2048 tokens. This results in approximately 20 million parameters. For each batch, we pick 32 random MIDI files (batch size)... We train with the Adam optimizer, a learning rate of 10 4, without dropout.