reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MLRegTest: A Benchmark for the Machine Learning of Regular Languages

Authors: Sam van der Poel, Dakotah Lambert, Kalina Kostyszyn, Tiantian Gao, Rahul Verma, Derek Andersen, Joanne Chau, Emily Peterson, Cody St. Clair, Paul Fodor, Chihiro Shibata, Jeffrey Heinz

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This article presents a new benchmark for machine learning systems on sequence classiﬁcation called MLReg Test, which contains training, development, and test sets from 1,800 regular languages. ... Finally, the performance of diﬀerent neural networks (simple RNN, LSTM, GRU, transformer) on MLReg Test is examined. The main conclusion is that performance depends signiﬁcantly on the kind of test set, the class of language, and the neural network architecture. Section 5. Experiments This section reports on the experiments that were conducted to assess the capabilities of generic neural networks to model the languages in MLReg Test.
Researcher Affiliation	Academia	Sam van der Poel EMAIL School of Mathematics Georgia Institute of Technology Dakotah Lambert EMAIL Department of Computer Science Haverford College Kalina Kostyszyn EMAIL Department of Linguistics & Institute of Advanced Computational Science Stony Brook University Tiantian Gao EMAIL Rahul Verma EMAIL Department of Computer Science Stony Brook University Derek Andersen EMAIL Joanne Chau EMAIL Emily Peterson EMAIL Cody St. Clair EMAIL Department of Linguistics Stony Brook University Paul Fodor EMAIL Department of Computer Science Stony Brook University Chihiro Shibata EMAIL Department of Advanced Sciences Graduate School of Science and Engineering Hosei University Jeﬀrey Heinz EMAIL Department of Linguistics & Institute of Advanced Computational Science Stony Brook University
Pseudocode	No	The paper describes methods in prose, for example, 'The Short Random Test sets generated positive strings as follows. For each length ℓ, the automaton A was constructed by intersecting the automaton for L with the automaton for Σℓ, and removing the positive strings from both the training and development sets.' There are no explicitly labeled pseudocode or algorithm blocks or figures labeled as such.
Open Source Code	Yes	Software used to create and run the experiments in this paper are available in a Github repository at https://github.com/heinz-jeffrey/subregular-learning under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).
Open Datasets	Yes	MLReg Test is publicly available with Dryad https://doi.org/10.5061/dryad.dncjsxm4h under the license CC0 1.0 Universal (CC0 1.0) Public Domain Dedication (https://creativecommons.org/ publicdomain/zero/1.0/).
Dataset Splits	Yes	For each language, the benchmark includes three nested training sizes with equal numbers of positive and negative examples, three nested development sizes with equal numbers of positive and negative examples, and three nested sizes of four distinct test sets with equal numbers of positive and negative examples. The above procedures produced 6 data sets (Train, Dev, SR, SA, LR, LA), each with 50,000 positive and 50,000 negative strings. We then made additional Train, Dev, SR, SA, LR, LA sets of 1/10th and 1/100th the size by downsampling. Consequently, for every language we prepared 3 training sets, 3 development sets and 12 test sets. The sets with 100,000 words we call Large , those with 10k words we call Mid , and those with 1,000 words we call Small.
Hardware Specification	No	Language class veriﬁcation, data set creation, and neural network training and evaluation were completed on the Stony Brook Sea Wulf HPC cluster maintained by Research Computing and Cyberinfrastructure, and the Institute for Advanced Computational Science at Stony Brook University and made possible by NSF grant #1531492. This mentions an HPC cluster but does not provide specific hardware details such as CPU or GPU models.
Software Dependencies	No	The Tensorﬂow (Abadi et al., 2015) and Keras (Chollet et al., 2015) APIs were used throughout the experiments. To generate data sets, we used the software library Pynini (Gorman, 2016; Gorman and Sproat, 2021), which is a Python front-end to Open Fst (Allauzen et al., 2007). While software libraries are mentioned, no specific version numbers are provided for reproducibility.
Experiment Setup	Yes	The search was organized as follows. A representative selection of 32 languages from MLReg Test was chosen... For all architecture types and all languages in the selection, we ran an exhaustive search over all models in the following hypergrid: number of feed-forward layers (2 or 4); embedding dimension (32 or 256); learning rate (0.01 or 0.0001); dropout (0.0 or 0.1); number of epochs (32 or 64); loss function (binary cross-entropy or mean squared error); and optimizer (RMSProp, Adam, or SGD). The results of the grid search are listed in Table 5. Table 5 explicitly lists selected hyperparameters: 'Learning Rate 0.0001', 'Optimizer Adam', 'Number of Epochs 64', 'Loss Function BCE', 'Embedding Dimension 32', 'Number of Feed Forward Layers 4', 'Dropout 0.1' for different network types. It also states: 'All neural networks were trained with a batch size of 64 and used binary cross-entropy (BCE) loss.'