Training Neural Networks as Recognizers of Formal Languages
Authors: Alexandra Butoi, Ghazal Khalighinejad, Anej Svete, Josef Valvoda, Ryan Cotterell, Brian DuSell
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings, using a general method that can be applied to a wide variety of languages. As part of this, we extend an algorithm recently proposed by Snæbjarnarson et al. (2025) for efficient length-controlled sampling of strings from regular languages. We provide results on a variety of languages across the Chomsky hierarchy for three neural architectures: a simple RNN, an LSTM, and a causally-masked transformer. We find that the RNN and LSTM often outperform the transformer, and that auxiliary training objectives such as language modeling can help, although no single objective uniformly improves performance across languages and architectures. |
| Researcher Affiliation | Academia | 1ETH Zurich 2Duke University 3University of Copenhagen EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Convert a partial DFA A to a WDFA AD over the nmaxth-order binning semiring with respect to the log semiring. This implicitly creates an intermediate PDFA A over the real semiring with uniform probabilities. Time complexity: O((|Q| + |δ|)nmax). [...] Algorithm 7 Sample a string of length n from a WDFA A D over the Dth-order binning semiring with respect to the real semiring, where D n. Also output a sequence of n + 1 next-symbol sets. The DFA A D must be the first output of Algorithm 2, and NEXT must be the output of Algorithm 6. Time complexity: O(n log |δ|out), where |δ|out is the maximum out-degree of A D. |
| Open Source Code | Yes | We have publicly released our datasets as a benchmark called FLa Re (Formal Language Recognition), along with our code.2 ... 2https://github.com/rycolab/neural-network-recognizers |
| Open Datasets | Yes | We have publicly released our datasets as a benchmark called FLa Re1 (Formal Language Recognition), along with our code.2 ... 1https://github.com/rycolab/flare |
| Dataset Splits | Yes | For each language, we sample a single, fixed training set of 10k examples with lengths in [0, 40]. We run separate experiments with two different validation sets that are designed to address different scientific questions, each with 1k examples: a short validation set with string lengths in [0, 40], and a long validation set with string lengths in [0, 80]. ... In Table 2, we show recognition accuracy on a test set with string lengths in [0, 500]. It has 5,010 examples, or an average of 10 examples per length. |
| Hardware Specification | No | Our experiments are small enough that we are able to run them in CPU mode, without GPU acceleration. |
| Software Dependencies | No | Our implementations for all three architectures are based on those provided by PyTorch (Paszke et al., 2019). |
| Experiment Setup | Yes | We use 5 layers for all models in all experiments. We automatically adjust the hidden vector size d so that the number of parameters is as close as possible to 64k, excluding extra parameters for language modeling and next-symbol prediction heads. ... Wherever dropout is applicable, we use a dropout rate of 0.1. ... We randomly sample the batch size B from a uniform distribution over [128, 4096]. We randomly sample the initial learning rate from a log-uniform distribution over [0.0001, 0.01]. We randomly sample the loss term coefficients λLM and λNS, when they are needed, from a log-uniform distribution over [0.01, 10]. We train each model by minimizing the loss function defined in 3.3 using Adam (Kingma & Ba, 2015). We clip gradients with a threshold of 5 using L2 norm rescaling. We multiply the learning rate by 0.5 after 5 checkpoints of no decrease in recognition cross-entropy on the validation set, and we stop early after 10 checkpoints of no decrease. We select the checkpoint with the lowest recognition cross-entropy on the validation set when reporting results. We train for a maximum of 1k epochs. |