Text2midi: Generating Symbolic Music from Captions
Authors: Keshav Bhandari, Abhinaba Roy, Kyra Wang, Geeta Puri, Simon Colton, Dorien Herremans
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive empirical evaluations, incorporating both automated and human studies, that show our model generates MIDI files of high quality that are indeed controllable by text captions that may include music theory terms such as chords, keys, and tempo. |
| Researcher Affiliation | Academia | 1Queen Mary University of London 2Singapore University of Technology and Design EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the mathematical formulation and architecture of the model with figures, but it does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the code and music samples on our demo page for users to interact with text2midi. Code https://github.com/AMAAI-Lab/Text2midi |
| Open Datasets | Yes | Midi Caps is a dataset of 168,401 unique MIDI files with text captions (Melechovsky, Roy, and Herremans 2024). The MIDI files were originally provided in the Lakh MIDI dataset (Raffel 2016), released under the CC-BY 4.0 license. ... Symphony Net (Liu et al. 2022) is a comprehensive dataset of symphonic music. |
| Dataset Splits | Yes | We use the provided training set (~90% of the data) to train the model in our experiments. ... We consider 100 (5%) randomly selected samples from the Midi Caps test set (Melechovsky, Roy, and Herremans 2024). |
| Hardware Specification | Yes | Our models are trained on 6 NVIDIA L40S 48 GB GPUs. |
| Software Dependencies | No | The paper mentions using the MidiTok library, the Music21 library, FLAN T5 model, and the Adam optimizer, along with specific tokenizer methods like REMI+, but does not specify version numbers for these software components or other general dependencies like programming languages or deep learning frameworks. |
| Experiment Setup | Yes | For pretraining, we train for 100 epochs, with a batch size of 4 and gradient accumulation set to 4. For finetuning on Midi Caps, we trained for 30 epochs. For both runs, we use the Adam optimizer (Kingma and Ba 2014) coupled with a cosine learning rate schedule with a warm-up of 20,000 steps. For pretraining, our base learning rate is 1e-4 whereas for finetuning, we use a reduced base learning rate of 1e-6. |