MLRegTest: A Benchmark for the Machine Learning of Regular Languages
Authors: Sam van der Poel, Dakotah Lambert, Kalina Kostyszyn, Tiantian Gao, Rahul Verma, Derek Andersen, Joanne Chau, Emily Peterson, Cody St. Clair, Paul Fodor, Chihiro Shibata, Jeffrey Heinz
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This article presents a new benchmark for machine learning systems on sequence classification called MLReg Test, which contains training, development, and test sets from 1,800 regular languages. ... Finally, the performance of different neural networks (simple RNN, LSTM, GRU, transformer) on MLReg Test is examined. The main conclusion is that performance depends significantly on the kind of test set, the class of language, and the neural network architecture. Section 5. Experiments This section reports on the experiments that were conducted to assess the capabilities of generic neural networks to model the languages in MLReg Test. |
| Researcher Affiliation | Academia | Sam van der Poel EMAIL School of Mathematics Georgia Institute of Technology Dakotah Lambert EMAIL Department of Computer Science Haverford College Kalina Kostyszyn EMAIL Department of Linguistics & Institute of Advanced Computational Science Stony Brook University Tiantian Gao EMAIL Rahul Verma EMAIL Department of Computer Science Stony Brook University Derek Andersen EMAIL Joanne Chau EMAIL Emily Peterson EMAIL Cody St. Clair EMAIL Department of Linguistics Stony Brook University Paul Fodor EMAIL Department of Computer Science Stony Brook University Chihiro Shibata EMAIL Department of Advanced Sciences Graduate School of Science and Engineering Hosei University Jeffrey Heinz EMAIL Department of Linguistics & Institute of Advanced Computational Science Stony Brook University |
| Pseudocode | No | The paper describes methods in prose, for example, 'The Short Random Test sets generated positive strings as follows. For each length ℓ, the automaton A was constructed by intersecting the automaton for L with the automaton for Σℓ, and removing the positive strings from both the training and development sets.' There are no explicitly labeled pseudocode or algorithm blocks or figures labeled as such. |
| Open Source Code | Yes | Software used to create and run the experiments in this paper are available in a Github repository at https://github.com/heinz-jeffrey/subregular-learning under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/). |
| Open Datasets | Yes | MLReg Test is publicly available with Dryad https://doi.org/10.5061/dryad.dncjsxm4h under the license CC0 1.0 Universal (CC0 1.0) Public Domain Dedication (https://creativecommons.org/ publicdomain/zero/1.0/). |
| Dataset Splits | Yes | For each language, the benchmark includes three nested training sizes with equal numbers of positive and negative examples, three nested development sizes with equal numbers of positive and negative examples, and three nested sizes of four distinct test sets with equal numbers of positive and negative examples. The above procedures produced 6 data sets (Train, Dev, SR, SA, LR, LA), each with 50,000 positive and 50,000 negative strings. We then made additional Train, Dev, SR, SA, LR, LA sets of 1/10th and 1/100th the size by downsampling. Consequently, for every language we prepared 3 training sets, 3 development sets and 12 test sets. The sets with 100,000 words we call Large , those with 10k words we call Mid , and those with 1,000 words we call Small. |
| Hardware Specification | No | Language class verification, data set creation, and neural network training and evaluation were completed on the Stony Brook Sea Wulf HPC cluster maintained by Research Computing and Cyberinfrastructure, and the Institute for Advanced Computational Science at Stony Brook University and made possible by NSF grant #1531492. This mentions an HPC cluster but does not provide specific hardware details such as CPU or GPU models. |
| Software Dependencies | No | The Tensorflow (Abadi et al., 2015) and Keras (Chollet et al., 2015) APIs were used throughout the experiments. To generate data sets, we used the software library Pynini (Gorman, 2016; Gorman and Sproat, 2021), which is a Python front-end to Open Fst (Allauzen et al., 2007). While software libraries are mentioned, no specific version numbers are provided for reproducibility. |
| Experiment Setup | Yes | The search was organized as follows. A representative selection of 32 languages from MLReg Test was chosen... For all architecture types and all languages in the selection, we ran an exhaustive search over all models in the following hypergrid: number of feed-forward layers (2 or 4); embedding dimension (32 or 256); learning rate (0.01 or 0.0001); dropout (0.0 or 0.1); number of epochs (32 or 64); loss function (binary cross-entropy or mean squared error); and optimizer (RMSProp, Adam, or SGD). The results of the grid search are listed in Table 5. Table 5 explicitly lists selected hyperparameters: 'Learning Rate 0.0001', 'Optimizer Adam', 'Number of Epochs 64', 'Loss Function BCE', 'Embedding Dimension 32', 'Number of Feed Forward Layers 4', 'Dropout 0.1' for different network types. It also states: 'All neural networks were trained with a batch size of 64 and used binary cross-entropy (BCE) loss.' |