Text2midi: Generating Symbolic Music from Captions

Authors: Keshav Bhandari, Abhinaba Roy, Kyra Wang, Geeta Puri, Simon Colton, Dorien Herremans

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive empirical evaluations, incorporating both automated and human studies, that show our model generates MIDI files of high quality that are indeed controllable by text captions that may include music theory terms such as chords, keys, and tempo.
Researcher Affiliation Academia 1Queen Mary University of London 2Singapore University of Technology and Design EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the mathematical formulation and architecture of the model with figures, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code Yes We release the code and music samples on our demo page for users to interact with text2midi. Code https://github.com/AMAAI-Lab/Text2midi
Open Datasets Yes Midi Caps is a dataset of 168,401 unique MIDI files with text captions (Melechovsky, Roy, and Herremans 2024). The MIDI files were originally provided in the Lakh MIDI dataset (Raffel 2016), released under the CC-BY 4.0 license. ... Symphony Net (Liu et al. 2022) is a comprehensive dataset of symphonic music.
Dataset Splits Yes We use the provided training set (~90% of the data) to train the model in our experiments. ... We consider 100 (5%) randomly selected samples from the Midi Caps test set (Melechovsky, Roy, and Herremans 2024).
Hardware Specification Yes Our models are trained on 6 NVIDIA L40S 48 GB GPUs.
Software Dependencies No The paper mentions using the MidiTok library, the Music21 library, FLAN T5 model, and the Adam optimizer, along with specific tokenizer methods like REMI+, but does not specify version numbers for these software components or other general dependencies like programming languages or deep learning frameworks.
Experiment Setup Yes For pretraining, we train for 100 epochs, with a batch size of 4 and gradient accumulation set to 4. For finetuning on Midi Caps, we trained for 30 epochs. For both runs, we use the Adam optimizer (Kingma and Ba 2014) coupled with a cosine learning rate schedule with a warm-up of 20,000 steps. For pretraining, our base learning rate is 1e-4 whereas for finetuning, we use a reduced base learning rate of 1e-6.