reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Detecting Music Performance Errors with Transformers

Authors: Benjamin Shiue-Hal Chou, Purvish Jajal, Nicholas John Eliopoulos, Tim Nadolsky, Cheng-Yun Yang, Nikita Ravi, James C. Davis, Kristen Yeon-Ji Yun, Yung-Hsiang Lu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Polytune and previous works on Coco Chorales-E and MAESTRO-E, which encompass 14 different instruments and a variety of performance errors. To evaluate error detection performance, we adapt the transcription F1 score commonly used in music transcription tasks (Raffel et al. 2014). We present a comparison of our method against the baseline across different categories for Error F1, precision, and recall. As shown in Tab. 3, our method generally outperforms the baseline derived from (Benetos, Klapuri, and Dixon 2012; Wang, Ewert, and Dixon 2017).
Researcher Affiliation	Academia	Benjamin Shiue-Hal Chou, Purvish Jajal, Nicholas John Eliopoulos, Tim Nadolsky, Cheng-Yun Yang, Nikita Ravi, James C. Davis, Kristen Yeon-Ji Yun, Yung-Hsiang Lu School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA, 47907 EMAIL
Pseudocode	Yes	Algorithm 1: MIDI Error Generation Algorithm. This algorithm introduces errors into MIDI files. Abbreviations: PC (pitch change), TS (timing shift), EN (extra note).
Open Source Code	Yes	Code https://github.com/ben2002chou/Polytune
Open Datasets	Yes	Thus, we introduce an algorithm for synthetically generating errors in existing music datasets, Coco Chorales (Wu et al. 2022a) and MAESTRO (Hawthorne et al. 2018). We name the resulting augmented datasets as Coco Chorales-E and MAESTRO-E, respectively.
Dataset Splits	No	All results are based on a combined test set of 4401 tracks.
Hardware Specification	Yes	All models were trained on an NVIDIA A100-80GB GPU running a Linux operating system. The datasets introduced in this work, MAESTRO-E and Coco Chorales-E, were generated using AMD EPYC 7713 3.0 GHz CPUs.
Software Dependencies	Yes	We used Pytorch 2.3.0 and Huggingface Transformers 4.40.1 for model design and training. The mir eval package is used for evaluating Error Detection F1 scores.
Experiment Setup	Yes	To address this imbalance, we use a weighted cross-entropy loss, as shown in Equation 1. Equation 1 defines the weighted cross-entropy loss L, averaged over N tokens. CE(yi, ˆyi) is the cross-entropy between true label yi and prediction ˆyi, weighted by a function α(yi). For our training, α(yi) is 10 when yi is an error token and 1 when it is not. ...We introduce errors into each MIDI file by selecting notes based on a Poisson distribution with a mean rate parameter λ, where λ is sampled from a uniform distribution U(0.1, 0.4) and applying an error type. Then, the randomly selected notes are assigned an error type, and their time and pitch are augmented accordingly. Offset magnitudes for time and pitch are sampled from two truncated normal distribution distributions, P and Q, with mean 0 and standard deviation of 1 and 0.02, respectively.