Anticipatory Music Transformer

Authors: John Thickstun, David Leo Wright Hall, Chris Donahue, Percy Liang

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train anticipatory infilling models using the large and diverse Lakh MIDI music dataset. These models match the performance of autoregressive models for prompted generation, with the additional capability to perform infilling control tasks, including accompaniment. Human evaluators report that an anticipatory model produces accompaniments with similar musicality to even music composed by humans over a 20-second clip.
Researcher Affiliation Collaboration John Thickstun EMAIL Department of Computer Science Stanford University David Hall EMAIL Department of Computer Science Stanford University Chris Donahue EMAIL Google Deep Mind and Carnegie Mellon University Percy Liang EMAIL Department of Computer Science Stanford University
Pseudocode Yes Algorithm 1: Anticipatory Autoregressive Sampling Algorithm 2: Autoregressive Sampling (Baseline)
Open Source Code Yes Contributions. We define an arrival-time encoding of events and controls that is amenable to expressive autoregressive sequence modeling and facilitates anticipation (Section 2). We describe the interleaved structure of an anticipatory autoregressive model, and how to train and sample from this model (Section 3). We apply anticipation to construct anticipatory infilling models for music, trained on the Lakh MIDI music dataset (Raffel, 2016). These models unlock new control capabilities for music generation without sacrificing the performance of unconditional generation (Section 4). We release all code for reproducing these models, along with pre-trained model weights.1
Open Datasets Yes We apply anticipation to construct anticipatory infilling models for music, trained on the Lakh MIDI music dataset (Raffel, 2016).
Dataset Splits Yes We split this dataset in to train, validation, and test splits according to the leading hexadecimal digit of each file s MD5 hash: Train: hashes 0 d, 144,202 event sequences, 7827 hours of music. Validation: hash e, 10,212 event sequences, 555 hours of music. Test: hash f, 10,333 event sequences, 561 hours of music.
Hardware Specification Yes The models are implemented in Jax (Bradbury et al., 2018) and trained on Google TPU v3 hardware. Most of these models were trained on TPU v3-32 pod slices, which in practice are approximately equivalent to a GPU machine with 8 NVIDIA A100 s.
Software Dependencies No The paper mentions using Jax, Levanter library, and Mido library but does not provide specific version numbers for any of these software components.
Experiment Setup Yes All models trained in this paper are parameterized using causal masked transformers (Vaswani et al., 2017) (decoder-only models) with a context length of M = 1024 tokens (defined in Section 3.3). We train models at three scales, following GPT-2 naming conventions (Radford et al., 2019): Small (128M parameters), Medium (360M parameters), and Large (780M parameters). Table 6: Model Configurations. This table provides detailed hyperparameters for Layers (12, 24, 36), Attention Heads (12, 16, 20), Hidden Dimensions (768, 1024, 1280), Sequence Length (1024 tokens), Residual Dropout (0.1), Embedding Dropout (0.1), Attention Dropout (0.0), Weight Decay (0.1), Max Learning Rate (0.0006, 0.0003, 0.0002), Optimizer (Adam W), Batch Size (512 sequences), Warmup (1000 steps), Learning Rate Schedule (Cosine decay), and Gradient Clipping (1).