reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling

Authors: Louis Bradshaw, Simon Colton

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes. The data pipeline we use is multi-stage... We provide an in-depth analysis of our techniques, offering statistical insights, and investigate the content by extracting metadata tags... We evaluate the effectiveness of the components in our data pipeline. Where applicable, we compare our methods to those used in previous work, in particular the Giant MIDI, ATEPP, and Pi JAMA datasets. For baselines and to determine ground truth, we relied on human labels... Classification precision, recall, and F1 scores can be seen in Table 3.
Researcher Affiliation	Academia	Louis Bradshaw, Simon Colton Queen Mary University of London EMAIL
Pseudocode	No	The paper describes algorithms in text, such as the pseudo-labeling process (Figure 1 description) and the sliding-window technique for audio segmentation, but it does not contain any structured pseudocode blocks or figures explicitly labeled as algorithms. For example, it describes the audio segmentation as: 'To do this, we employ a sliding-window based technique adapted from standard approaches (Keogh et al., 2004), aimed at accurately removing non-piano content whilst being robust to short-lived classification mistakes. Given an audio recording, we score each five-second interval, sampled with a one-second stride, by passing the inputs through our model.'
Open Source Code	Yes	Dataset available at https://github.com/loubbrad/aria-midi. We outline a process for distilling an audio source-separation model to train a classifier capable of accurately identifying and segmenting diverse real-world piano recordings, which we open-source1. 1https://github.com/loubbrad/aria-cl We used a Whisper-based model, Aria-AMT5 (Bradshaw et al.), to transcribe the segmented audio recordings into MIDI files. 5https://github.com/EleutherAI/aria-amt
Open Datasets	Yes	We introduce an extensive new dataset of MIDI files... Dataset available at https://github.com/loubbrad/aria-midi. Initial investigations revealed that relying on well-known datasets such as MAESTRO (Hawthorne et al., 2018) and Audio Set (Gemmeke et al., 2017) was insufficient... We also used the Giant MIDI audio files, the Jazz Trio Database (Cheston et al., 2024)...
Dataset Splits	No	The paper describes selecting a 'random sample of 250 videos' and 'a random sample of 250 audio recordings' for evaluation and analysis of its methodology components, excluding those used during training. However, it does not provide explicit training, validation, or test splits with specific percentages, counts, or references to predefined splits for either its own created dataset or for the specific models it trains (e.g., the audio classifier) in a way that would allow direct reproduction of the entire experimental setup with fixed data partitions.
Hardware Specification	Yes	Transcription of the 100,629 hours of audio took 765 hours using an NVIDIA H100 GPU with a batch size of 128... In comparison, classification of 100,000 hours of audio using our model only took 20 A100 hours, I/O being the main bottle-neck.
Software Dependencies	No	The paper mentions using 'the 70B parameter version of Llama 3.1 (Dubey et al., 2024)' for the language model, applying 'the MVSep Piano source-separation model (Uhlich et al., 2024; Fabbro, 2024; Solovyev et al., 2023)', and using a 'Whisper-based model, Aria-AMT5 (Bradshaw et al.)' for transcription. It also states the model was trained 'using the Adam W optimizer (Loshchilov and Hutter, 2019)' and results were calculated 'using the mir eval library (Raffel et al., 2014)'. However, specific version numbers for software libraries, frameworks, or optimizers (like PyTorch version, TensorFlow version, or mir_eval version) are not provided.
Experiment Setup	Yes	For our solo-piano classifier, ... We trained the model for ten epochs using the Adam W optimizer (Loshchilov and Hutter, 2019) with β1, β2 = 0.9, 0.95, ϵ = 1e-6 and an L2 weight decay of 0.01. A linear learning rate scheduler was used, decaying to 10% of the initial learning rate after a warmup over the first 500 optimizer steps... The parameters d and λ control the sensitivity and minimum length of non-piano segments, which we set to 3 and 0.5 respectively.